computer aided molecular design: theory and practice, volume 12 (computer aided chemical...

COMPUTER AIDED MOLECULAR DESIGN: THEORY AND PRACTICE

COMPUTER-AIDED CHEMICAL ENGINEERING

Advisory Editor: R. Gani

Volume 1: Volume 2: Volume 3: Volume 4:

Volume 5:

Volume 6: Volume 7:

Volume 8:

Volume 9:

Volume 10:

Volume 11:

Volume 12:

Distillation Design in Practice (L.M. Rose) The Art of Chemical Process Design (G.L. Wells and L.M. Rose) Computer Programming Examples for Chemical Engineers (G. Ross) Analysis and Synthesis of Chemical Process Systems (K. Hartmann and K. Kaplick) Studies in Computer-Aided Modelling. Design and Operation Part A: Unite Operations (1. Pallai and Z. Fony6, Editors) Part B: Systems (1. Pallai and G.E. Veress, Editors) Neural Networks for Chemical Engineers (A.B. Bulsari, Editor) Material and Energy Balancing in the Process Industries - From Microscopic Balances to Large Plants (V.V. Veverka and F. Madron) European Symposium on Computer Aided Process Engineering-10 (S. Pierucci, Editor) European Symposium on Computer Aided Process Engineering- 11 (R. Gani and S.B. Jorgensen, Editors) European Symposium on Computer Aided Process Engineering- 12 (J. Grievink and J. van Schijndel, Editors) Software Architectures and Tools for Computer Aided Process Engineering (B. Braunschweig and R. Gani, Editors) Computer Aided Molecular Design: Theory and Practice (L.E.K. Achenie, R. Gani and V. Venkatasubramanian, Editors)

COMPUTER-AIDED CHEMICAL ENGINEERING, 12

COMPUTER AIDED MOLECULAR DESIGN: THEORY AND PRACTICE Edited by

Luke E.K. Achenie Computer Aided Process and Product Design Lab Department of Chemical Engineering University of Connecticut 191 Auditorium Road Storrs, CT06269, USA

Rafiqul Gani CAPEC, Technical University of Denmark Department of Chemical Engineering Building 229, DK-2800 Lyngby, Denmark

Venkat Venkatasubramanian Laboratory of Intelligent Process Systems School of Chemical Engineering Purdue University West Lafayette, IN 4 790 7-1283, USA

2 0 0 3 E L S E V I E R

A m s t e r d a m - B o s t o n - L o n d o n - N e w Y o r k - O x f o r d - Par is S a n D i e g o - S a n F r a n c i s c o - S i n g a p o r e - S y d n e y - T o k y o

E L S E V I E R S C I E N C E B.V. Sara B urgerhar ts t raat 25 P.O. Box 211, 1000 AE Ams te rdam, The Nether lands

�9 2003 Elsev ie r Science B.V. All r ights reserved.

This work is pro tec ted under copyr ight by Elsevier Science, and the fo l lowing terms and condi t ions apply to its use:

Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or pro- motional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use.

Permissions may be sought directly from Elsevier Science via their homepage (http://www.elsevier.com) by selecting 'Customer support' and then 'Permissions'. Alternatively you can send an e-mail to: permissions @elsevier.corn, or fax to: (+44) 1865 853333.

In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK; phone: (+44) 207 631 5555; fax: (+44) 207 631 5500. Other countries may have a local reprographic rights agency for payments.

Derivative Works Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations.

Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter.

Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier Science Global Rights Department, at the fax and e-mail addresses noted above.

Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, neg- ligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made.

First edition 2003

Library of Congress Cataloging in Publication Data A catalog record from the Library of Congress has been applied for.

British Library Cataloguing in Publication Data A catalogue record from the British Library has been applied for.

ISBN: 0-444-51283-7 ISSN: 1570-7946 (Series)

( ~ The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands.

Pre face

CAMD or Computer Aided Molecular Design refers to the design of molecules with desirable properties. That is, through CAMD, one determines molecules that match a specified set of (target) properties. CAMD as a technique has a very large potential as in principle, all kinds of chemical, bio-chemical, and material products can be designed through this technique. It has become a mature technique and attracting more and more researchers and finding increasing industrial applications. The limitation, at this moment, is the ability to estimate the target properties of the desired molecule.

The book mainly deals with macroscopic properties and therefore, does not cover molecular design of large, complex chemicals such as drugs. The methodology presented, however, would be applicable for such problems provided the higher level molecular structural representation is integrated with appropriate molecular structure-property relationships. While books have been written on computer aided molecular design related to drugs and large complex chemicals, a book on systematic formulation of CAMD problems and solutions with emphasis on theory and practice which would help one to learn, understand and apply the technique is currently unavailable.

With this book, we have tried to put together the theoretical aspects related to CAMD, the different techniques that have been developed and the different applications that have been reported. We have highlighted the applications through case studies.

We have grouped the chapters of this book into 3 parts - Part I: Theory, Methods & Tools; Part II: Applications & Practice of CAMD; and Part III: New Frontiers. Problem formulation and solution techniques are covered in Part I by chapters 1-7. Applications and practice of CAMD in different types of problems are highlighted in chapters 8-15 of Part II together with descriptions of case study problems and their solution. Each case study highlights the application of specific CAMD techniques. Part III contains one single chapter (16) where we highlight the new frontiers (in our view) and the future of CAMD.

We have targeted a mixed audience in this book. Specifically, we have designed the book for scientists and engineers from industry who would like to apply CAMD to solve their specific problems of interest. It is also designed for educators from academia who would like to use it for teaching as part of process/product design courses (including such courses as separation processes). The book would be of interest to scientists and engineers who would like to learn more about CAMD in addition to

vi

CAMD problem solutions. Finally, this book is intended for those who would like to use it as the starting point to further develop and extend the state of the art in CAMD.

We would like to thank all the contributing authors for their manuscripts and for agreeing to make the necessary changes to accommodate the content, format and style of this book. The contributing authors to the various chapters of this book come from academia as well as industry. They are among the leading researchers, developers and users of CAMD. We hope the book will serve to promote further development of CAMD and further interest from the industry to apply CAMD.

We thank the reviewers for their valuable comments and suggestions. We thank Elsevier for their interest in this subject and for publishing this book. We acknowledge the support, help and contribution of Prasanjeet Ghosh, Santhoji Katare, Mette Dinsen and all our previous students and coworkers who have contributed to the development of CAMD in general and preparation of this book in particular. We also thank all the companies who have shown interest in CAMD and supported our research in this area.

We hope the readers of this book will find it an invaluable resource in their research, development and educational activities. We also hope that the book will generate enough interest and valuable feedback for future editions.

Luke E. K. Achenie, Rafiqul Gani & Venkat Venkatasubramanian

List of contributors

Author L. E. K. Ache nie

C. S. Adjiman

A. Apostolakou

E. A. Brignole

A. Buxton

J. M. Caruthers

M. Cismondi

J. L. Cordiner

R. Gani

P. M. Harper

M. Hostrup

A. Hugo

Address University of Connecticut, Department of Chemical Engineering, 191 Auditorium Road, Storrs, CT 06269, USA Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. Planta Piloto de Ingenieria Quimica-PLAPIQUI (UNS'CONICET), Camino La Carrindanga Km 7, 8000, Bahia Blanca Argentina. Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. Laboratory for Intelligent Process Systems School of Chemical Engineering, Purdue University, West Lafayette, IN-47907, USA. Planta Piloto de Ingenieria Quimica-PLAPIQUI (UNS-CONICET), Camino La Carrindanga Km 7, 8000, Bahia Blanca Argentina. Syngenta, Global Specialist Technology, Grangemouth Manufacturing Centre, Earls Road, Grangemouth, Stirlingshire, FK3 8XG, United Kingdom CAPEC, Technical University of Denmark, Department of Chemical Engineering, Building 229, DK'2800 Lyngby, Denmark. Integrated Process Solutions ApS, Solvgade 14B, 1307 Copenhagen K, Denmark Integrated Process Solutions ApS, Solvgade 14B, 1307 Copenhagen K, Denmark Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of

viii

A. G. Livingston

G. M. Ostrovski

P. Patkar

E. N. Pistikopoulos

M Sinha

A. Sundaram

Vo Venkatasubramanian

J. M. Vinson

Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. University of Connecticut, Department of Chemical Engineering, 191 Auditorium Road, Storrs, CT 06269, USA. Laboratory for Intelligent Process Systems School of Chemical Engineering, Purdue University, West Lafayette, IN-47907, USA. Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. Global Alternative Propulsion Center, General Motors, Honeoye Falls, NY 14472, USA. ExxonMobil Process Research, Pauslboro, NJ 08066, U.S.A. Laboratory for Intelligent Process Systems School of Chemical Engineering, Purdue University, West Lafayette, IN-47907, USA. Pharmacia Corporation, 5200 Old Orchard Rd., Skokie, IL 60077, USA.

Contents Page

Preface

List of contributors vii

PART I: Theory, Methods & Tools 1. Introduction to CAMD

R. Gani, L. E. K. Achenie and V. Venkatasubramanian 2. Molecular D e s i g n - Generation & Test Methods

E. A. Brignole and M. Cismondi 3. Optimization Methods in CAMD - I

M. Sinha, L. E. K. Achenie and G. M. Ostrovski 4. Optimization Methods in CAMD - II

A. Apostolakou and C. S. Adjiman 5. Genetic Algorithms Based CAMD

P. R. Patkar and V. Venkatasubramanian 6. A Hybrid CAMD Method

P. M. Harper, M. Hostrup and R. Gani 7. Identification of Multistep Reaction Stoiehiometries"

CAMD Problem Formulation A. Buxton, A. Hugo, A. G. Livingston and E. N.

Pistikopoulos

23

43

63

95

129

167

Part II: Applications of CAMD 8. CAMD for Solvent Selection in I n d u s t r y - I

J. M. Vinson 9. CAMD for Solvent Selection in I n d u s t r y - II

J. L. Cordiner 10. Case Study in Optimal Solvent Design

M. Sinha, L. E. K. Aehenie and G. M. O~trovskl 11. CAMD in Solvent Mixture Design

M. Sinha and L. E. K. Aehenie 12. Refrigerant Design Case Study

A. Apostolakou and C. S. Adjiman 13. Polymer Design Case Study

P. R. Pa tkar and V. Venka tasubramanian 14. Case Study in Identification of Multistep Reaction

Stoiehiometries A. Buxton, A. Hugo, A. G. Livinggton and E. N. Pi~tikopoulos

15. Molecular Design of Fuel Additives A. Sundaram, V. Venkatasubramanian and J. M. Caruthors

211 213

229

247

261

289

303

319

329

PART III: Computer Aided Product Design 16. Challenges and Opportunities for CAMD

R. Gani, L. E. K. Achenie and K Venkatasubramanian

355 357

Glossary of Terms 379

Subject Index 387

Author Index 393

Part I: Theory , M e t h o d s & Tools

This part of the book covers problem formulation and solution techniques. The first chapter introduces the computer aided molecular design (CAMD) problem and discusses its important issues. Then chapters 2 to 7 deal with some of the common techniques used to tackle various types of CAMD problems. Specifically, the second chapter discusses methods based on a generate-and-test approach, followed by two chapters on optimization methods involving mathematical programming. Evolutionary techniques based on genetic algorithms are presented next in chapter 5 while chapter 6 describes a hybrid CAMD method. Finally, the first part of the book concludes with chapter 7 where CAMD in identification of multistep reaction stoichiometries is presented.

This Page Intentionally Left Blank

Computer Aided Molecular Design: Theory and Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian (Editors) �9 2003 Elsevier Science B.V. All fights reserved.

C h a p t e r 1" I n t r o d u c t i o n to CAMD

R. Gani, L.E.K. Achenie & V. Venka tasubramanian

In (chemical) product design, we try to find a (chemical) product tha t exhibits certain desirable or specified behaviour. In another type of (chemical) product design, we try to find an additive that when added to another chemical or non-chemical product, enhances its (desirable) functional properties. This type of a product is commonly known as a formulation. That is, in (chemical) product design, we do not know the identi ty of the final product but we have some idea of how we want it to behave and the problem is to find the most appropriate chemical(s) tha t will exhibit and/or cause the desired behaviour. Once we have identified the product, and have tested it, we need to determine if it can also be manufactured. That is, we need to design a (chemical) process through which we can manufacture the desired product with profit, increased operational efficiency and positive environmental, heal th and safety impact. Before we can do this, however, we also need to determine the likely raw materials (which could also be other chemical products) tha t can be processed in order to manufacture the desired product. That is, we extend the problem boundary of process design at the s tar t by determining the product that we would like to manufacture and at the end in order to analyse the effect of the product and its manufacture on the environment.

1.1 WHAT IS CAMD?

The design process for a chemical product involves a number of steps through which scientific principles may be applied for the solution of the specified design problem. Cussler and Moggridge (2001) suggest four principal steps in their design process:

1. Define needs; 2. Generate ideas to meet needs; 3. Select among ideas; 4. Manufacture product.

As i l lustrated in Fig. 1, the 2 nd and 3 rd steps considered together, represent two types of design problems namely, Molecular Design and Mixture/Blend Design. The I st step may be considered as a pre-design or problem formulation step while the last step may be considered as part of a process design problem. The molecular and mixture/blend design

problems can be solved independent of the process design problem or as an integrated product-process design problem.

Pre-Design

"define needs & goals"

Pro duct Design

CAMD

"generate, & select alternatives"

I CAMbD

"generate & ~lect alternatives"

Process-Product Design

' I ~ Process Design

] ~ "malmfacture & test product"

Figure 1: Steps of the design process related to product design.

For the solution of the molecular and mixture/blend design problems, various approaches, ranging from empirical trial and error approaches to mathematical programming to hybrid methods can be applied as the solution technique. The applicability of a particular solution technique depends, to a large extent, on the approach used to determine the target behaviour (properties) of the desired products. If appropriate property models do not exist, although not the most efficient, an empirical trial and error approach based on experimentation is usually the only option. If property models are available, computer aided methodologies become viable alternatives. That is, the molecular design problem is transformed into a computer aided molecular design (CAMD) problem while mixture/blend design problem becomes transformed into a computer aided mixture design (CAMDD) problem through the use of property models as part of a computer aided methodology. CAMD and CAMDD together may be called computer aided product design (CAPD). Unless specifically mentioned, in this book, the term CAMD will be used for molecular design as well as mixture/blend design. Likewise, the term product will be used to include single molecules as well as mixtures.

1.1.1 P r o b l e m Def in i t ion

Computer aided molecular design problems are defined as

Given a set of building blocks and a specified set of target properties, determine the molecule or molecular structure that matches these properties.

In this respect, it is the reverse problem of property prediction where given the identity of the molecule and]or the molecular structure, a set of target properties are calculated. CAMD maybe performed at various levels of size and complexity of molecular structure representation. For example, design of solvents, refrigerants, etc., are usually based on properties estimated from macroscopic structural information. In the design of structured products such as polymers, drugs, pesticides, food additives, etc., the structural differences are observed by employing meso- and/or microscopic representation of the molecular structure. Therefore, the property models and the molecular structural representation differ according to the type of molecules being designed.

Computer aided mixture/blend design problems can be defined as,

Given a set of chemicals and a specified set of property constraints, determine the optimal mixture and~or blend.

Here, we do not know which chemicals to use in the product and in what amount they should be present but we know the molecular structures of the candidate chemicals. The design of formulated products and blends are typical examples of mixture design. Here, a formulation (representing a mixture or blend) is added to a product in order to enhance one or more specified properties of the original product. For example, a specified property (for example, viscosity of a product) needs to increase by an order of magnitude when the formulation (also known as ingredient or additive) is added. In other cases, a mixture or blend having a specified set of target properties is the desired p roduc t - as in polymer blends, petroleum blends, solvent blends, edible oil blends and many more.

The fundamental objective of CAMD, therefore, is to identify a compound or a collection of compounds having specific (desired) properties. The structures of the compounds (molecules) are represented using appropriate descriptors together with an algorithm that identify these descriptors. This means that the property evaluation methods should be based on these descriptors as well.

The most common approach in CAMD is to generate chemically feasible molecular structures from a set of descriptors (represented by fragments or building blocks) and to test them by estimating their desired (specified) properties. The properties are estimated by using some kinds of fragment- based methodology, where the contributions for a specific property of each fragment present in the compound molecular are added to determine the compound property value. The set of feasible compounds are identified as those that match the property specifications, given as a series of property

constraints. The optimal compound is identified from the set of feasible compounds through a problem specific selection criteria or objective function. The principal differences between the various CAMD methodologies are how the various steps are performed, the type of descriptors used and how the necessary property values are obtained.

1.1.2 F o r m u l a t i o n of Property Constra ints

The formulation of the property constraints is a prerequisite for solving any CAMD problem. A set of properties is selected as constraints with some combination of specified goal values, lower and upper bounds. These represent explicit property constraints because their values can be determined directly through a model or measured experimentally. There are, however, desired properties involving products such as food, fragrances, health & safety, etc., that may need to be formulated implicitly. That is, they cannot be measured or predicted by a model directly but may be inferred through databases, past knowledge, other measured or predicted properties and so on. For example, taste of a food product, the aroma of fragrances, the health hazards of chemicals, etc., fall under implicit property constraints. Environmental considerations can be formulated implicitly or explicitly. Explicit considerations relate physical properties to environmental considerations (e.g. ozone depletion potential) while implicit considerations are realized in the selection of the types of compounds considered in the search/design phase (e.g. the exclusion of aromatic compounds). The following questions help to define the c o n s t r a i n t s - note that these are not the only questions that will help to define the problem completely.

What function is the desired product supposed to perform?

These functions could be related only to the use of the product on a standalone basis or, they could be included as part of some greater functionality that the product may be asked to provide in conjunction with other materials. Examples of the former are a solvent, a refrigerant, and a polymer while examples of the latter are a solvent blend added to a paint, an ingredient added to a food product to make it fat-free, and an ingredient added to a drug to inhibit a specific biological function.

Is the product a replacement for another product?

If yes the designed product should do some combination of the following (a) match a set of properties, (b) match or surpass a set of properties of the original product and (c) avoid a third set of properties. This can be the replacement of one synthesized chemical product with another as well as replacement of a natural product with a synthesized one (for example, synthetic rubber).

Are there any operational l imi ts ( temperature, pressure and phase) for the des ired product? I f yes, wha t are these?

The operational limits help define the upper and lower limit of the constraints on the phase and the phase transition related properties.

What criteria shou ld be used to evaluate the per formance of the desired product?

The performance criteria are related to the function of the desired product in the process operation for which it is designed, which helps to define the objective function for optimization based CAMD. For example, as a solvent in solvent based separations, these criteria often degenerate into bound constraints; usually lower bounds on selectivity, lower bounds on distribution coefficient, upper bounds on solvent loss and many more. In the case of formulations, the ingredient needs to be tested for the enhanced performance of the original product, such as controlled release, improved inhibition, etc., of drugs. Models for evaluation of performance, however, may not be easy and is most likely to be very complex.

Are there any downs t r eam process ing considerations?

The role of the designed product in downstream processing, such as solvent recovery, wastewater treatment and disposal, needs to be considered. They may be included as direct property constraints, if feasible. However, since they depend on the process, alternatively, the product and process design problems may be integrated to handle these constraints together with other process design issues.

The following provides a generic representation of most CAMD problems.

mathematical programming

s.t. FOBJ = m a x {C T y + f (x)} (1)

hl (x) = 0 ....process design specs (2) h2 (x) = 0 ....process model equat ions (3)

h3 (x) = 0 .... C A M D specif ications (4) ll ~_ gl (x) ~_ ul ....process design constraints (5) 12 ~_ g2 (x) ~_ u2 .... C A M D constraints (6) 13 ~_ B y + C x ~_ u3 .... logical constraints (7)

In the above equations, x represents the vector of continuous variables (such as flowrates, mixture compositions, condition of operation, design variables, etc.), y represents the vector of binary integer variables (such as unit operation identity, descriptor identity, compound identity, etc.), hi (x) represents the set of equality constraints related to process design specifications (such as, reflux ratio, operation pressure, heat addition,

etc.), h2 (x) represents the set of equality constraints related to the process model equations (i.e., mass and energy balance equations), h3(x) represents the set of equality constraints related to CAMD (such as, chemical feasibility rules, mixing rules for properties, etc.), gl(x) represents the set of inequality constraints (process design specifications) and g2(x) represents the set of inequality constraints with respect to environmental constraints and property constraints related to CAMD design. The binary variables typically appear linearly as they are included in the objective function term and in the constraints (Eq. 7) to enforce logical conditions. The term f(x) represents a vector of objective functions that may be linear or non-linear depending on the definition of the optimization problem. For process optimisation, f(x) is usually a non-linear function while for integrated approaches, f(x) usually consists of more than one non-linear function.

Many variations of the above mathematical formulation may be derived to represent different CAMD problems and methodologies. Some examples are given below.

ii)

iii)

iv)

v)

Satisfy only constraint 6. This represents a CAMD problem for which a database search is adequate as a solution methodology. Ignore the objective function and the constraints represented by Eqs. 2, 3, 5 and 7 and only satisfy constraints 4 and 6. This is a CAMD problem that generates a feasible set of candidates. Solve a mathematical programming problem that includes Eqs. 1, 4 and 6. This is optimal design of the molecule and/or mixture. Only satisfy the constraints 2-7. This generates a feasible set of candidates (products and their corresponding process). Solve all the equations. This represents an integrated process- product design problem.

Note that for all problem formulations, properties either need to supplied (measured or database retrieval) and]or predicted through models. Problems that include Eq. 3, also have property models included as a set of constitutive models that relates the properties to the intensive variables (pressure, temperature and composition). All problem formulations may use property models and therefore, the application range of a CAMD methodology depends on the application range of the property models used.

Note that in problem formulations i-ii, an optimal design may be obtained by ordering all the feasible candidates according to the objective function (Eq. 1) value. Global optimality, however, can only be guaranteed if and only if all possible compounds were considered in the generation of the feasible set of candidates. On the other hand, problem formulations iii-v, may become too complex to solve if the property model is highly non-linear and discontinuous. Also, the solution approach may not be able to accommodate multiple property models for the same property. In this way,

while these problem formulations can determine the optimal design, their application range is usually quite small.

Having formulated the property constraints and a version of the generic problem formulation, the next step is to select the property models and/or means to provide the necessary property values.

1.1.3 P r e d i c t i o n of Proper t i e s

Successes of CAMD methodologies depend to a large extent, on the ability to predict and/or obtain the necessary pure component and mixture properties, or more generally, performance characteristics, included in the property constraints and in the process model. Even if the CAMD problem involves the design of a single molecule, mixture properties may need to be calculated. For example, in solvent design, the property constraints may include pure component properties such as boiling point, heat of vaporization and mixture properties such as solubility of solute and solvent loss. In CAMbD problems, the property constraints are all mixture property based, however, the models for these mixture properties may require pure component properties. Consequently, the pure component properties may be used to screen out some of the candidate molecules to be considered in the mixture design problem.

A wide range of property models can be found (Poling et al. 2000). The main question is which model has the largest reliable application range for the descriptors used to represent the molecular structures? For instance, if the descriptors employed for molecular structural representation are able to identify differences in isomer structures, then the property model must also be able to predict the property differences (if any) of these isomers. Otherwise, all isomers would be selected as feasible. Gani and Constantinou (1995) proposed a classification of properties as primary (pure component properties that can be determined only from the molecular structural variables - examples are critical properties, normal boiling point, normal melting point, heat of vaporization at 298 K, heat of fusion at 298 K, etc.), secondary (pure component properties that are dependent on other p rope r t i e s - examples are surface tension, viscosity, solubility parameter, vapor pressure at a given temperature, density at a given temperature, etc.) and functional (pure component properties dependent on temperature and/or p r e s s u r e - examples are density, vapor pressure, enthalpy, heat of vaporization, etc., as a function of temperature; and mixture properties that are dependent on composition and/or temperature & p r e s s u r e - examples are liquid phase activities, vapor phase fugacities, phase density, mixture viscosity, mixture saturation temperature, etc.). For several material design applications of interest, the desired properties are even more complex, high-level performance characteristics that are to be satisfied by the material during its active service life. These performance measures are usually very difficult to predict using standard property-prediction models. Sophisticated models,

10

usually hybrids of different approaches, need to be constructed. Examples of such systems or properties include reaction systems (i.e. where the final desired performance may come into play only at the end of chemical or biological reactions), long-term mechanical properties, biological functionalities, etc. Further several of these performance measures are dynamic i.e., time-evolving. In such cases, not only is the value of a particular high-level property at the start of active service life of the material important, but also, and usually more critical, its evolution profile throughout the period of service.

Gani and Constantinou (1995) also propose a classification of property models that may be employed for each class of properties. Figure 2 highlights this classification.

Classification of Est imation Methods

Reference / Mechanical Semi-empilical

models models

Quantum Mechanics

Molecular Mechanics

Molecular Simulation

Corresponding States Theory

Topology / Geomet~'y

Group / Atom / Bond additivity

EmphJcal models

Chemometrics

Pattern matching

Facto," analysis

QSAR

Figure 2: Classification of property estimation methods

E s t i m a t i o n o f p r i m a r y pure c o m p o n e n t p r o p e r t i e s

While there are numerous property estimation methods for primary pure component properties, not all of them are applicable in CAMD. Most property estimation methods used in CAMD methodologies are based on the Group Contribution Approach, GCA, (Franklin, 1949) where the properties of a compound are expressed in terms of functions of the number of occurrences of predefined fragments (groups) in the molecule. The GCA-based methods belong to a class known as additive methods.

F (p) = w~Z Ni C~ + w, s M~ D~ + w~X Oh Eh +. (S)

11

In the above equation, Ci is the contribution of atom, bond or first-order group i; Ni is the number of occurrences of atom, bond or first-order group i; Dj is the contribution of atom, bond or second-order group j ; / ~ is the number of occurrences of atom, bond or second-order group j; Ek is the contribution of atom, bond or third-order group k; Oh is the number of occurrences of atom, bond or third-order group k. wi, w2, w3 are weights tha t may be imposed on each of the additive terms. With this method, if the fragments (atoms, bonds, groups, etc.) representing each molecule are identified and their contributions to a needed property are available, then the corresponding property of the molecule can be est imated by simply summing all the contributions. Since the same fragments can be used to represent different molecules, these property estimation methods, although semi-empirical in nature, are also truly predictive. Note that the atoms and bonds only consider the number of occurrences and not their placement in this type of methods. The limitations of these methods are accuracy and ability to handle complex molecular structures. However, in principle, these methods can be made to be highly accurate with large application range by simply adding more additive terms of higher order. From a practical point of view, this is not feasible and the highest order of this type of methods is three (Marerro and Gani, 2001). Second- and third- order additive methods are able to distinguish some isomeric molecular structures.

Methods based on topological or geometric information provide a higher level of molecular representation. The methods based on topological information related to the molecular structure commonly employ the well- known connectivity index (Kier and Hall, 1986; Bicerano, 1993) while methods based on geometric information employ conjugates (Constantinou et al. 1994). Connectivity indices specify the spatial ar rangement of the atoms in the molecule, while, conjugation (with respect to molecular structures) refers to an idealized arrangement of atoms connected by bonds (Constantinou et al. 1994). Any property p is estimated through Eq. 9 (connectivity index) or Eq. 10 (conjugation).

F (p) = aX ' + b X 1 + c X 2 + d X 3 + ... . . (9)

F (p) = E N~ B~ + E Mj Ej (10)

In Eq. 9, X n is the connectivity index of order n; and a, b, c & d are the adjustable parameters. In Eq. 10, Bi is the contribution of bond i; Ni is the number of occurrences of bond i; Ej is the contribution of bond j ; / ~ is the number of occurrences of bond j. The main computational effort is spent on generating the connectivity indices or conjugates representing a molecular structure. Once these are known, the properties estimation phase is simple and computationally inexpensive. As in the additive methods, these methods are also predictive. Another advantage of these methods is that the indices and/or conjugates may be used to generate the fragments for

12

the additive methods. In this way, they use additional s tructural information than the additive methods and therefore, are able to distinguish more isomeric structures. The main difficulty is to know how many indices should be used and how to estimate their property contributions.

The topological information based methods are also classified under QSPR (Quanti tat ive Structure Property Relationship) or QSAR (Quantitative Structure Activity Relationship) methods. Many QSPR and QSAR methods base the prediction of properties on the structure of the molecule using complex descriptors obtained from molecular modeling. CAMD methodologies dealing with meso- and microscopic representation of the molecular structures employ such descriptors to identify the differences in the molecular structures as well as to estimate the needed properties. While these property models are able to employ complex descriptors and to distinguish between isomeric structures, their application range outside the t raining set of molecules may be questionable. Therefore, they are more suitable for use in CAMD problem formulations of types i & ii but are able to handle large, complex molecules. More details on QSPR and QSAR methods can be found in Kier and Hall (1986) and Livingstone (2001).

Est imat ion of secondary pure component properties

The best source of methods for this type of properties is the book by Poling et al . (2000), which gives a comprehensive overview of the properties and the corresponding property models that may be used. Therefore, in this book, we are not covering these methods. It should be noted, however, that many of the secondary properties that are calculated from primary properties might also be converted to primary properties. For example, the Hansen's solubility parameters are estimated from known values of molar volumes and heats of vaporizations at 298 K. The solubility parameter data can therefore be also correlated through a set of groups or topological indices to generate a primary property model. In a similar way, properties such as Octanol-Water partition coefficients and water solubilities may also be converted to primary pure component properties. Since the pr imary pure component properties are only functions of the molecular s t ructural variables, they are very useful in CAMD problem solution.

Est imat ion of mixture properties

The simplest and easiest, but usually the least accurate way, is to assume mixture ideality and employ a simple linear mixing rule.

F (O) - V~ x~ p~ (11)

13

In the above equation, F (0) is a property function for mixture property 0; x i is the composition of component i and pi is the corresponding pure component property of 0 for component i. If the assumption of mixture ideality is valid, this method is fast, easy and very convenient for use in CAMD problem formulations of types iii-v. Most practical problems, however, do not behave ideally and therefore, more rigorous models are needed. Since CAMD methodologies generate molecular structures and therefore, work with molecular structural parameters, models that do not employ such parameters are therefore not suitable. Examples of these models are NRTL (Renon and Prausnitz, 1968) and Wilson (Wilson, 1964), which need compound specific, and predetermined molecular interaction parameters for estimation of liquid phase activity coefficients.

The most widely used mixture property in many CAMD applications are the liquid phase activity coefficients because they may be used for estimating solubility (solid, liquid or gas), phase equilibrium (considering the other phase in equilibrium with the liquid to be ideal), for liquid surface tension, liquid viscosity, bulk properties such as saturation temperatures and pressures and many more. GCA-based methods are the only practical choices in this case since the topological information based methods have not been developed for general purpose use and molecular modeling based methods are too complex for use in CAMD problem formulations of types ii-v.

The GCA-based method for prediction of liquid phase activity coefficients that is most widely used in CAMD methodologies is the UNIFAC method (Fredenslund et al., 1977) in its original form or in its various modifications. A major limitation of the UNIFAC method with its original set of first-order groups is that it cannot handle complex mixture non- ideality (such as proximity effects) and it cannot distinguish between isomers. Some of these limitations have been addressed recently through the introduction of second-order groups (Kang et al. 2002). Another important limitation of UNIFAC and all other GCA-based mixture property models is that the necessary group interaction parameters may not be available for the generated feasible candidate molecules. Molecular modeling in this respect can help to predict the necessary group interactions (Jonsdottir et al. 1994).

For CAMD involving large, complex molecules and mixture properties, problem formulations of type i-ii are feasible options as they allow the use of sequential generation of feasible candidate molecules and testing of candidates. In this case, any number of property models may be used. While this is not a computationally efficient procedure, it is able to provide a means to identify promising candidates, at least, as a first step of the search.

14

Estimation of environmental, implicit and high-level properties

Environmental and other implicit properties need special attention since they do not usually belong to the standard databases for properties of chemical compounds. For the estimation of environmental properties, such as toxicity, biodegradability, ozone depletion potential, biological oxygen demand, global warming potential, soil adsorption potential, very few general methods covering a wide range of compounds have been developed, although, new methods are continuously being developed (Martin and Young 2001). However, a number of methods valid for specific molecular types such as alcohols, acids, benzene derivatives are available (Lyman et al., 1990). These methods are capable of predicting many of the environmental properties listed above. Often, methods for environmental properties rely on the Octanol/Water partition coefficient (log P) as a known property value. Databases such as CHRIS (Silver Platter Information Inc., 1998a), HSDB (Silver Platter Information Inc., 1998b) and RTECS (Silver Platter Information Inc., 1998c) store environmental data and properties for a large number of substances.

The more difficult properties are high-level performance characteristics desired of the material. Examples of these include properties related to taste of food products, aroma of fragrances, long-term mechanical properties of polymers and polymer blends and many more. What often makes the modeling process even more challenging is that several of these properties of interest are dynamic and the design objectives are specified in terms of the time-evolution profile of the property in question throughout the service time of the material. Some of these maybe estimated through a combination of higher-level modeling and theory, such as molecular modeling combined with kinetic phenomena (in the case of polymer blends with desired properties) while others may be implied through QSAR types of investigations. Typically, highly sophisticated hybrid approaches that make use of a variety of modeling techniques need to be employed to model the high-level properties to desired levels of prediction accuracy (Ghosh et al., 2000).

Having the necessary property models available brings us to the next topic - t h e actual CAMD algorithm.

1.1.4 CAMD algorithms

The CAMD algorithm basically solves the CAMD problem formulations of type i-v and other variations of the generic problem defined by Eqs. 1-7. The main solution step involves finding the molecules of the desired type having the desired properties. Here, a difference is made between those problems that involve only selection (type i and some variation of type ii) and those that involve selection plus design (types ii-v). If the problem is of the selection type (i.e. finding candidates from a database of known compounds) the solution step involves one or more database lookup

operations in order to identify the subset (if any) satisfying the property and molecule type constraints. For pure component properties based selection, the search engine is commonly known as pattern matching (Nielsen et al., 1991), that is, find the specified pattern in a database. If mixture properties are also considered, the search is more difficult. Cabezas (2000) have developed tools for efficiently solving these problems.

If the CAMD problem formulation is of type ii-v, an algorithm is needed to identify (design) the molecules of the specified types and having the desired properties as specified through the property constraints. Even though different algorithms have been proposed for design of molecules, nearly all algorithms rely on, to some degree, the creation of chemically feasible molecules from fragments. The most widely used feasibility criteria is the valency rule proposed by Macchietto et al. (1990) where the goal is to guarantee the fulfillment of the octet rule.

Different approaches have been proposed for solving CAMD problems and these approaches can be grouped into three categories:

1. Mathematical programming (a mathematical representation of the problem is solved with a numerical optimization me t hod ) - problem type iii-v. Chapters 3, 4 and 11 describe these types solution approaches.

2. Stochastic optimization (a mathematical representation of the problem is solved by numerical stochastic methods) - problem type ii-iii. Chapter 5 describes a genetic algorithm based solution approach of this type.

3. Enumeration techniques (a combined mathematical and qualitative representation of the problem is solved by hybrid solution approaches) - problem type ii-v, but using a decomposed problem formulation (also called hybrid methods). Chapters 2, 6 and 7 describe solution approaches of this type.

Common to all the solution approaches is that the objective is to find a compound or compounds fulfilling the requirements set forth in the constraints and goals.

1.1.5 Molecular S truc ture R e p r e s e n t a t i o n

All CAMD methodologies need to employ some form of representation of the molecular structure information for use in property estimation. In general, the estimation methods used for predicting properties of the designed molecule(s) decide the level of detail needed for the molecular structural information and the representation method to use. Other considerations are compatibility with external programs and databases.

The simplest form of a compound is an atomic representation based on chemical formula. Here, a compound is simply represented by the types of

15

atoms it contains and the number of occurrences of each atom type (Fig. 3a). A single representation can describe a large number of compounds of very different types. No direct information regarding the bonds in the compound can be extracted from the representation. Although, if assumptions of the valency of the different atom types are made, it is possible to calculate bond configurations. A related representation form is the representation of a compound as a collection (or vector) of groups. A group is a molecular fragment or substructure defined by the number and types of atoms in the fragment, how the atoms are connected, how many free connections the group has and where (on which atom) they are located. Figure 3b shows an example of a fragment and Fig. 3c an example of a group vector. A group vector contains some information about the connectivity of the structure of the molecule but does not define it completely. As a result, a group vector can represent more than one possible molecule (isomers) - Figure 3d illustrates the different compounds that are possible to construct using the group vector in Fig. 3c.

The compounds depicted in Fig. 3d have the connectivity defined. One of the most versatile and manageable methods is the adjacency matrix. An adjacency matrix is a square symmetrical matrix with rows and columns representing the atom (or fragments) in the molecule and containing zeroes and non-zeroes indicating bonds or absence of bonds. An adjacency matrix can be on fragment level or on atomic level. Conversion from a fragment-based matrix to an atomic based matrix is achieved by substi tuting the entry for each fragment with that of the atomic adjacency matrix representing the fragment. Figures 3e and 3f are the fragment based and atom based adjacency matrices, respectively, for the first compound in Fig. 3d.

While the adjacency matrix defines the 2-dimensional relations between atoms in a compound, it does not contain the steric information needed in order to distinguish R/S, L/D and Cis/Trans isomers. In order to distinguish between such isomers it is necessary to have 3-dimensional information about the placement of the atoms. For 3-dimensional representation two methods are widely used. The first is the combination of an adjacency matrix with a list of x, y, z Cartesian coordinates for the atoms. The second is the so-called internal coordinate system where an atom's position is defined by a length, a bond angle and a torsion angle (Maranas and Floudas, 1994). Choice of the type of representation depends on the computations that are to be performed with the 3- dimensional representation.

Chapter 2 describes methods for generating molecular structures using group information only. Chapters 3 and 4 give examples of how the generation of molecular structures can be incorporated into mathematical programming formulations through the feasibility rules. Chapter 4 also gives a detailed description of generation of molecular structures from higher-level groups (Marerro and Gani, 2001). Chapter 5 describes how

17

employ ing groups and topological indices can g e n e r a t e mo lecu la r s t r u c t u r e s t h r o u g h genet ic a lgor i thms . Final ly , c h a p t e r 7 descr ibes groups b a s e d combina t ion ru les to g e n e r a t e molecules t h a t also sa t i s fy r eac t ion s to ich iomet ry .

,o

H2C~C ~ 2 CH3

C 5 H 1 0 0 2 / ~ I CH2 o 1 CH2COO /

(a) (b) (c)

H 3 C ~ C H 2 H 3 C ~ C H 2 O \ \ //

o/~ C ~ O H 2 C ~ C \ \

H 2 C ~ C H 3 O ~ C H 3 (d)

CH3 CH3 CH2 CH2COO

CH3 CH3 CH2 CH2COO 0 0 0 1 0 0 1 0 0 1 0 1 1 0 1 0

(e)

H H H H H H H H H H C C C C C O O 0 1

0 1 0 1

0 1 0 1

0 1

H H H H H H H H H H C C C C C 0

,0

0 1 0 1

0 1 0 1

1 1 1 0 1 1 1 1 0 1

1 1 1 0 1 1 1 1 0

(0

Figure 3: Different levels of molecular structure representation (Harper, 2000)

18

1.2 KEY I S S U E S & THEIR R E L A T I O N S H I P S

Some of the key issues and their relationships associated with the generat ion of molecular s tructures and the predictions of the properties of the generated compounds are highlighted here (from Harper 2000).

�9 Computat ional L o a d - This is related to the amount of calculations required to solve any CAMD problem.

�9 Generat ion L e v e l - This is related to the steps employed to generate molecular s tructures (compounds). With increasing levels of molecular s t ructural information, the degree of detail and information also increases.

�9 Property Range - The Property Range is the total number of properties to be calculated for a generated molecule in order to evaluate if it matches the specified requirements . Each of the properties in the Property Range may have an associated constraint value indicating a lower and/or upper bound tha t must be fulfilled if the generated molecule is to be retained for further screening.

�9 Property L e v e l - This is related to the level of "complexity" involved in the est imation of a needed property. This is a theoretical measure of the amount of information needed in order to calculate the property based on:

o The type of molecular information needed in order to use the selected property est imation method.

o Whether or not the property requires other properties in order to be calculated (that is, if they are secondary properties).

o The complexity of the calculation, tha t is, is the calculation iterative, does it involves solution of a system of equations or is it otherwise calculation intensive?

o If a property p depends on other properties, the level (with respect to calculation order) of property p must be higher than the levels of the other properties. Therefore, if the level of property p is determined on the basis of the levels of other properties, it is not a fixed value for all calculations involving using property p - but is a variable.

o Whether the property p is a dynamic i.e. t ime-evolving property. Certain high-level, complex performance measures may involve not only the value p(O) of the property at the s tar t of the material 's active service life, but also the profile p(t) of its evolution with t ime over the service period.

Property T r u s t - The level of "confidence" one can assign to a property. This depends on:

o Est imat ion accuracy.

19

o The dependence of other calculated properties, for example, error propagation.

o Applicability of the method(s) to the compound(s) in question.

For any CAMD problems it is necessary to identify the Generat ion Levels needed for a given CAMD problem. It is necessary to cover the entire property range (of the target properties) within the generat ion levels. The number of levels needed is determined by the available property es t imat ion methods. As a consequence of this, the property range and the available property est imat ion methods control the min imum generat ion level.

1.3 T A R G E T S FOR A CAMD FRAMEWORK

From the above discussion, it is clear tha t any CAMD methodology requires a number of methods and tools tha t need to work in an in tegra ted manner . An architecture tha t glues the various methods and tools together into a CAMD framework could therefore be very useful for fur ther development of CAMD methodologies in a systematic manner as well as increasing the solution range of any CAMD methodology. The targets for the development of a CAMD framework could be (Harper 2000):

�9 The correct formulation of the Property Range is critical to the success of a CAMD method. Failure to identify the impor tan t properties will lead to the generation of the wrong products. It is therefore necessary to include a methodology for the formulation of the ta rge t property constraints within a CAMD framework.

�9 The ability to predict a wide range of properties using different methods would broaden the application range of CAMD. Therefore, a CAMD framework must be able to use other prediction methods in addition to the t radi t ional ly used GCA methods. This requires the generat ion and integrat ion of detailed molecular models.

�9 While the design of highly detailed molecular s t ructures improves the ability to predict properties accurately there can be a significant associated computat ional cost. If highly detailed molecules (in te rms of s t ructura l information) are to be generated, it is necessary tha t the computat ional efficiency of the CAMD algori thm be taken into account in the development of the CAMD framework.

�9 The minimizat ion of uncer ta in ty is impor tan t when performing complex calculations. Consequential ly the use of correlations should be minimized and the use of exper imental data and accurate prediction methods (using all available information) should be maximized.

With the background presented in this chapter, we now move on to some of the tools and methods used to tackle the CAMD problem.

20

A c k n o w l e d g e m e n t

The PhD-thesis of Peter M. Harper (2000) has provided material in the form of text and figures for parts of this chapter.

1.4 REFERENCES

1. J. Bicerano, "Prediction of Polymer Properties", Marcel Dekker Inc. (1993).

2. Cabezas, H., "Designing green solvents", Chemical Engineering, 107 (3), March (2000) I07-109.

3. Chem-Bank, Chemical Hazards Response Information System (CHRIS) Database, Silver Platter Information Inc, MA, USA, November (1998a).

4. Chem-Bank, The Hazardous Substances Data Bank (HSDB), Silver Platter Information Inc, MA, USA, November, (1998b).

5. Chem-Bank, The Registry of Toxic Effects of Chemical Substances (RTECS), Silver Platter Information Inc, MA, USA, November (1998c).

6. L. Constantinou, S.E. Prickett and M.L. Mavrovouniotis, "Estimation of thermodynamic and physical properties of acyclic hydrocarbons using the ABC approach and conjugation operators", Ind. Eng. Chem. Res., 32 (1993), 1734.

7. L. Constantinou and R. Gani, "New group contribution method for estimating properties of pure compounds", AIChE J., 40 (1994) 1697.

8. Cussler, E. L., Moggridge, G. D., "Chemical Product Design", Cambridge University Press, USA (2001).

9. Aa. Fredenslund, J. Gmehling, P. Rasmussen, "Vapor liquid equilibria using UNIFAC", Elsevier Scientific, Amsterdam, The Netherlands (1977).

10. Franklin, J. L., "Prediction of Heat and Free Energies of Organic Compounds", Industrial Engineering & Chemistry, 41(1949) 1070

11. R. Gani, B. Nielsen and A. Fredenslund, "A group contribution approach to computer-aided molecular design", AIChE J., 37 (1991) 1318.

12. R. Gani, & L. Constantinou, "Molecular Structure Based Estimation of Properties for Process Design", Fluid Phase Equilibria, 116 (1996) 75-86.

13. Ghosh, P., A. Sundaram, V. Venkatasubramanian and J. Caruthers, "Integrated Product Engineering: A Hybrid Evolutionary Framework", Computers and Chemical Engineering, 24 (2000) 685- 691.

14. P. M. Harper, "A Multi-Phase, Multi-Level Framework for Computer Aided Molecular Design", PhD-thesis, Technical University of Denmark, Lyngby, Denmark (2000).

15. S. O. Jonsdottir, Kj. Rasmussen, Aa. Fredenslund, Fluid Phase Equilibria, 100 (1994) 121-138.

21

16.

17.

18.

19.

20.

21.

22.

23.

24.

25. 26. 27.

J. W. Kang, J. Abildskov, R. Gani, J. Cobas, "Estimation of Mixture Properties from First- and Second-Order Group Contributions with the UNIFAC Model", I&EC Research, 41 (2002) 3260-3273. L. Kier, L. H. Hall, "Molecular Connectivity in Structural-Activity Analysis", Wiley, New York, USA (1986). D. Livingstone, "Data analysis for chemists,: Application to QSAR and chemical product design", Oxford University Press, Oxford, UK (1995). L. J. Lyman, W. F. Reehl, D. H. Rosenblatt, "Handbook of Chemical Property Estimation Methods, Environmental Behavior of Organic Compounds", American Chemical Society, Washington DC., USA (1990). C. D. Maranas, C. A. Floudas, "A Deterministic Global Optimization Approach for Molecular Structure Determination", J. Chem. Phys., 100 (1994) 1247-1261. J. Marrero and R. Gani, "Group-contribution based estimation of pure component properties", Fluid Phase Equilibria, 183-184 (2001) 183. S. Macchietto, O. Odele and O. Omatsone, "Design of optimal solvents for liquid-liquid extraction and gas absorption processes", Chem. Eng. Res. Des., 68 (1990) 429. J. M. Nielsen, R. Gani, J. P. O'Connell, "TMS: A Knowledge Based Expert System for Thermodynamic Model Selection and Application", in "Computer-Oriented Process Engineering" ed. L. Puigjaner and A Espuna, Elsevier, 10 (1991) 29-34. B.E. Poling, J.M. Prausnitz, J.P. O'Connell, The properties of gases and liquids, 5 th edition, McGraw-Hill, New York, USA (2000). H. Renon, J. M. Prausnitz, AIChE J., 14 (1968) 135. G. M. Wilson, J. Am. Chem. Soc., 86 (1964) 127. T. D. Martin, D. M. Young, "Prediction of the Acute Toxicity (96-h LC50) of Organic Compounds to the Fathead Minnow Using a Group Contribution Method", Chem Res Toxicol, 14 (2001) 1378-1385.

Computer Aided Molecular Design: Theory and Practice L.E.K. Achenie, R Gani and V. Venkatasubramanian (Editors) �9 2003 Elsevier Science B.V. All fights reserved. 23

Chapter 2: Molecular D e s i g n - Generat ion & Test Methods

E.A.Brignole & M.Cismondi

2.1 INTRODUCTION

Traditionally the search for solvents or products for specific applications has been carried out by examining several compounds and families of compounds and selecting those with the desired properties. A more systematic approach to the solution of these problems is based on CAMD of solvents or products. In both cases an experimental validation of the component properties is recommended. The CAMD approach was introduced in the early eighties for the selection of solvents for separation process [1,2]. At that time the problem was formulated as follows: "Given a mixture and certain separation goals, synthesize, from the set of UNIFAC groups, molecular structures with the desired solvent properties. The groups are the building blocks for the synthesis process and the UNIFAC thermodynamic model is used for the evaluation of the primary solvent properties". UNIFAC is a group contribution based model [3] used for predicting the liquid phase activity coefficients of the compounds present in the mixture and the UNIFAC groups are the functional groups needed to represent the molecular structures of the compounds. These two stages: synthesis and evaluation are still the main components of the various types of CAMD techniques that have been developed.

The extensive development of group contribution methods for the prediction of pure component and mixture properties has been a fertile ground for the generalized use of product molecular design techniques. The original CAMD approach can be defined as the backward product design problem: "giving a set of property constraints and certain performance indexes, generate chemical structures with the desired physico-chemical and/or environmental properties". Applications have been reported for the design of polymers [4], refrigerants [5,6], product substi tution [7], solvents [8,9,10] and many more.

The first solvent design studies were based on solution properties derived from the UNIFAC group contribution method for computing activity coefficients [3]. Several revisions and extensions to electrolytes, polymers and equations of state, of the original UNIFAC predictive package have been presented [11]; a group contribution equation of state (GC-EOS) based on similar but more detailed group definitions, has been extended to

24

new groups and gases [12-14]. For the prediction of pure component properties, such as heat capacities, solubility parameters, formation energies, critical properties, etc, different group definitions have been proposed [15]. However, correlation of pure component properties has also been proposed in terms of the original UNIFAC groups [16,17], which are also called first-order groups [17, 18]. In this chapter the original UNIFAC group definitions will be used throughout.

This chapter presents the class of CAMD methods that is characterized as generate & test methods. At the macroscopic properties level, these type of methods were first developed for solvent selection and design. For the design of large complex molecules involving a higher level of molecular structural representation than functional groups, most of the procedures also employ generate and test type of CAMD methods. In this chapter, however, only the method based on groups as building blocks is discussed in detail.

2.2 THE EVOLUTION OF CAMD

The elements of a CAMD technique can be divided into algorithmic stages dealing with generation of molecules and testing of generated molecules, that is, i) the "generate" or molecular synthesis stage and ii) the "test" or molecular evaluation stage. The main features of the molecular synthesis stage are: group selection, group characterization and molecular feasibility rules. The result of the molecular synthesis stage is a number of feasible molecular structures. The main features of the molecular evaluation stage are: group contribution methods for property estimation, calculated properties, property constraints and evaluation (performance indexes). The final result is a ranked set of product candidates.

2.2.1 M o l e c u l a r S y n t h e s i s

Molecules are synthesized by joining groups with free-attachments until no free-attachments remain in the generated structure. This means that the search (or design) for suitable molecules is not limited to a given set of molecules. Although this is an attractive feature of CAMD, it also has its drawback - the number of structures that may be generated can be very large. Another important feature with respect to properties prediction (forward problem) and CAMD (reverse problem) is that while in the forward problem the groups representing a molecule are given, in the reverse problem, the group's free-attachment properties are also important [1,2] and need to be analysed. The free-attachments of a group are the number of chemical bonds available to neighbouring groups for a t tachment (or combination).

The characterisation of the group's combination properties is needed mainly to satisfy two criteria:

25

i) ii)

To obtain chemically feasible structures. To avoid proximity effects tha t could lead to unrel iable UNIFAC predictions.

Therefore, the generation of feasible molecular s t ructures from the groups is subject to several restrictions and is based on the f ree-a t tachments of the groups. Some of the restrictions are the resul t of the way the groups in the UNIFAC table are defined, while other restrict ions are made to prevent the formation of unstable compounds or the generat ion of new functional groups such as acetals (for which the property predictions will be uncertain). In an earlier publication on molecular design using UNIFAC groups [1], a set of combination rules were formulated:

a) Groups with two a t tachments cannot be combined to obtain a double bond.

b) Aromatic groups with two a t tachments (such as "ACCH2" see Table 3) must always have one a t t achment to the aromatic ring.

c) All non-hydrocarbon groups can only combine with a carbon a t tachment .

d) Only one bond of the carbon atom can be used for a t t achments with bonds other than those of carbon or hydrogen atoms.

In la ter works [2,8] a more detailed group character isat ion was introduced allowing a more general formulation of feasibility rules for al iphatic and aromatic compounds. The main chemical property used for the generat ion of combination rules was the electro negativi ty of the group bonds [2,8,9]. Other authors have proposed feasibility rules tha t satisfy the molecule neu t ra l i ty conditions. However, the chemical stabil i ty of the components is, in many cases, not guaranteed [5,6] with such feasibility rules. This is par t ly due to the way groups are defined in different group contribution methods and/or the lack of proper combination rules for the groups.

Classif ication of Groups

The UNIFAC groups with f ree-at tachments (or bonds) have one or more a t t achmen t s for combination among themselves. Groups with only one free a t t achmen t are defined as "terminal " groups. All other groups with more t han one free a t t achment are defined as "intermediate" groups. There are three types of in termedia te groups (i.e., groups with mult iple a t tachments) : radial, l inear and mixed. In the groups of the UNIFAC pa rame te r tables, there are no more than two atoms with "free" a t tachments . The "free" a t tachments of a group may be character ised by two properties: i) a t t achment status, which takes into account the combination properties and ii) valence, the number of a t tachments . Four

25

types of a t tachments , for paraffinic groups have been defined on the basis of thei r electro negativity:

�9 K: severely restricted a t tachment , e.g., "-OH", "CH30-" �9 L: par t ia l ly restricted a t tachment , e.g., "-CH2CI" �9 M : unrestr ic ted carbon a t t achment in single valence or l inear dual

valence groups �9 J : unrestr ic ted carbon a t tachments in radial paraffinic groups, e.g.

"-CH2-","-CH<"

Three basic types of group valences have been identified in aliphatic compounds: M, J tha t are classified as neutra l and L & K, with increasing degrees of electro negativity. The methyl group (CH3), even though it is a J type group, is identified as type M because it plays a different role in the feasibili ty cri teria analysis with respect to the other J groups.

The synthesis of aromatic compounds requires the introduction of addit ional a t tachments :

�9 I: aromatic carbon ring a t t achment such as ACH �9 H: subst i tu ted aromatic carbon ring a t t achment such as ACCL

Types M and J a t t achments are extended to aromatic groups as follows:

�9 M : unrest r ic ted a t t achment in a carbon linked to an aromatic carbon such as ACCH2-

�9 J : unrest r ic ted a t tachments in a "radial" carbon linked to an aromatic carbon, such as ACCH<

The valence of an aromatic carbon a t t achment has been defined to be one. For example, the characterizat ion of the group (ACH) is (I,1) where the first le t ter indicates the a t t achment type and the second number indicates the valence. Modifying its a t t achment type can change the combination propert ies of a given group. For instance, to avoid proximity effects between polar groups a type L a t t achment may be changed to a type K. For instance, the keto group (CH2CO) and the amino group (CH2NH) have both the same combination properties (L,1)(K,1), therefore the combination (NHCH2)-(CH2CO) is feasible as well as a keto-amino compound like (CH3)-(NHCH2)-(CH2CO)-(CH3). However, if proximity effects between the amino and the keto group are to be avoided, the keto group character izat ion can be modified to (K,2), in this case both a t t achmen t s of this group are highly restricted, and thus an and addit ional (CH2) is needed to link both functional groups: (CH3)-(NHCH2)- (CHD-(CH2CO)-(CH3).

27

C o m b i n a t i o n & Feas ib i l i ty Rules

The a t tachment combination properties for the synthesis of paraffinic, aromatic and cyclic solvents are:

R l: Type K at tachments can only be combined with unrestr icted carbon at tachments. R2: Type L at tachments can be combined with L, M or J a t tachments R3: The combination of a J a t tachment (radial paraffinic group) with a type K at tachment changes the status, of the remaining free a t tachments of the group, to L. R4: Aromatic rings are built only with type I and H attachments.

Very simple criteria were formulated to establish the feasibility criteria, when the above combination rules were applied to the synthesis of l inear paraffinic or linear mixed paraffinic-aromatic compounds, using only dual valence intermediate groups and single valence terminal groups,. For example, the set of groups that makes up a molecular structure should have a number of unrestricted carbon at tachments equal to or greater than the number of K (severely restricted) attachments. If this condition is satisfied, no further restrictions are imposed in the combination of the remaining at tachments for the case of linear molecular structures. The feasibility criteria for intermediate molecular structures (IMSs) and the final molecular structures (FMSs) are given in Table la. Examples of application of the feasibility criteria are given in Table lb.

Table la. Feasibility Criteria for IMSs and FMSs T y p e of

C o m p o u n d Aliphatic

IMSs FMSs

K~_ M + J / 2 +2 K~_M+J/ 2 Aromatic a I+H =6 I+H=6

Aliphatic-aromatic K~_ M + J / 2 K~_M+J/2 I+H =6 I+H=6

Cyclic - K~_M+J/ 2 aSingle ring aromatic structures

G e n e r a t i o n of Feas ib le Molecular S truc tures

The basic technique for the molecular synthesis stage in the Generate & Test methods follows a combinatorial approach. That is, enumerat ing the possible combinations (in this case FMSs) from the building blocks (in this case groups) and test each FMS for its structural and property constraints. Brignole et al. [2] proposed a combinatorial-partition strategy where a selected set of molecular groups are combined, considering all possible chemical structures and are then screened by checking the feasibility conditions. In principle the size of the combinatorial problem considering all the UNIFAC groups is of insurmountable magnitude. However, a

28

realistic implementation of many product or solvent design problems can be handled efficiently by a combinatorial molecular generation approach and will be discussed later. Pretel et al. [8] proposed molecular synthesis techniques based on intermediate and terminal groups for the generation of l inear (not branched) molecules.

Table lb. Examples of Feasibility Criteria for IMSs and F M S s Type of

Compound/ Group

Characterisation

Paraffinic/ (CH2) : (s 2)

(CHCl): (L,2) (CH = CH) : (K, 2)

(OH): (K, 1) , (CH3): (M,1)

Aromatic a~ (A C CH ,) : (H, 1)

A CH. (I,1) (A COH) : (H,1)

A lip hati c-aro matic/ (A CCH2) : (H,1)(M,1)

(A C).. (H,1) (M,1)

Cyclic/ CH2CO)." ( (L, 1) (K, 1)

IMSs

(CHCl) (CH = CH) (CH2) M=O; J=2,. L=2 K=2

M+ J/2 +2 = 3 (feasible IMS)

(A C) (A CH) /A CCHz) z M=3; J=O;I=3;

H=3; K=O M+J/2=3 I+H =6

(feasible IMS)

FMSs

(CH3)2 (CH=CH) (CHCI) (CH2) M=2;J=2;L =2;K=2

M+J/2=3 : Feasible FMS

(CH3) (CH = CH) (CHCl) (CH2) (OH) M=I;J=2;L=2;K=3

M+J/2=2 : Unfeasible FMS

(A CH) 3(A CCH3) 2(A COH) I=3,'H=3

I+H=6: Feasible Molecule

(OH) 2(A C) (A CH) /A C CH2) 2(CH~O) M=3;J=0;I=3;H=3;K=3

M+J/2 =3 I+H =6

(feasible FMS)

aSingle ring aromatic structures

(CH2)3 (CH2CO) M=O; J =6 ;L =1; K=I

M+J/2=3 (feasible FMS)

A direct extension of Pretel et a/.(8) feasibility criteria to branched structures is the following:

i Ki 5_ M + J (final structure) where J - J2 + J3 + J4 (1)

where Ki or Ji are the number of K or J groups with "i" at tachments in the structure and M is the number of methyl groups. However, this criteria leads, in many cases, to structures not described by UNIFAC groups. For instance when the above feasibility rule is applied to the final structure FMSa: (HCOO)(CH)(CHa)(OH):

K groups "(OH): (K, 1) ; (HCO0): (K, 1)

29

i K i - 1"2 - 2 M group: (CH3): (M,1) M=I J3 group (CH); (J, 3) J=l

By application of equation (1): M+J=2, (1) FMSa is feasible. However in this structure, the ter t iary carbon group -CH< is attached to two oxygen bonds, generating a combination of atoms (functional group) not available in the UNIFAC table. A feasible structure can be obtained by the addition of a (CH2) group leading to FMSb: (HCOO)(CH)(CH2)(CH3)(OH).

N e w G r o u p C o m b i n a t i o n P r o p e r t y C h a r a c t e r i z a t i o n

The failure of equation (1) to deal with branched structures can be explained as follows: after the combination of a J group, with valence greater than two, there are residual free at tachments whose combination properties may be modified when linked to K groups.

Therefore, the formulation of robust feasibility criteria for the synthesis of the branched structures requires not only the characterisation of the group free a t tachments but also of the group internal bonds. This is particularly the case of groups having L attachments. Therefore, a more detailed characterisation of group combination properties for aliphatic compounds was introduced by Cismondi and Brignole [18]. Considering the internal and free bonds, only two bond status: K (electronegative) and J (neutral) are required to characterise the combination properties. For example groups with L at tachments are formed by a combination of two "pure" K and J subgroups (see Table 2). A revised set of combination properties of UNIFAC groups is presented in Table 3. The methyl group (CH3) is still characterised as a neutral M group and it is not counted as a J group. With the new group characterisation more general feasibility criteria can be implemented.

Table 2: Redefinition of group combination properties in terms of J and K bond status

. . . . . . . . . . - : . . . . . . . - : - - = : - - = . - - - = - - = = - - - . . . . . . . - - = . . . . . . . . . = = - - = = = . . . . . . ~ = _ : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - - - - ~ = ~ _ : = _ . . . . . . . . . . = : . . . . . . . . = : = : _ . . . . . . . . . . . . . . . . ========================

UNIFAC Group Previous Decomposed New Group Valence Character- in Sub-groups combination

example isation p_roperties (CH2C1) 1 (L,1) J2 + K1 (K,1) (J,2) (CHC1) 2 (L,2) J3 + K1 (K,1) (J,3) (CCL) 3 (L,3) J4 + K1 (K,1) (J,4)

(CH2CO) 2 (K,1) (L,1) J2 + K2 (K,2) (J,2) (CHNH) 3 (K,1) (L,2) J3 + K2 (K,2) (J,3)

_ (CH2N) ................ 3 (K,2) (L,1) ........ J2 + K3 (K,3) (J,2)

30

Feas ib i l i ty Criteria for the Synthes i s of Linear or B r a n c h e d S t r u c t u r e s

Considering the molecular s t ructure as a combination of pure "K" and "J" groups or subgroups, the new synthesis concept is: each pure J group cannot be attached to more than one K group. In other words, the building of feasible molecules requires the existence of a J - J type bond for each K group incorporated into the molecule (after the first one, for not cyclic structures). For example consider the following sequence of feasible final s t ructures, where only the J-J bonds introduced for feasibility reasons are shown:

(C H~ C O) (CH3) --> ( CHz C O) (CHe)-(CH2) (C O CHs) ---> (CHs C O) (CH2)-(CH)-(CHJ (C O CHs)

"(OH)

The last is a branched structure with a ter t iary carbon linked to an (OH) group. This example shows how the addition each K group requires the introduction of J-J bonds in the final structures.

This synthesis concept can be formulated as follows:

K S N J J (cyclic) (2) K - I S N J J (noncyclic) (3)

where N J J is the number of J - J bonds

These conditions are valid for both in termedia te and final s tructures. Therefore the new feasibility criteria consist on determining the N J J by counting the number of type J a t t achments available. A "J a t t achments balance" could be obtained as follows:

Xi i Ji - 2 N J J + N J F when K<NJF m

(4)

o r

i Ji - 2 N J J + N J F + 2 (K-NJF) w hen K > N J F (5)

where the number of J free a t tachments is given by:

N J F - J8 + 2 J4 + 2 (non cyclic and J >_ 1) o r

N J F - Jz + 2 J4 (cyclic)

(6)

(7)

In the final s t ructure (non cyclic) of the previous example:

(CH3C O) (CH2)-(CH)-(CH2)(C O CH3) (OH)

31

J2=2; J3 =1; NJF=3, NJJ=2 ; K=3; Zi i J i - 7; J=3

Therefore the s t ructure verifies the feasibility criteria given by equation (3). However if this criterion is applied to FMSa: (HCOO)(CH)(CH3)(OH) discussed in the previous section:

J3 =1; NJF=3, NJJ=O ; K=2; Xi i J i - 3; J=l

The s t ructure is unfeasible because it does not satisfy equation 3. When K > NJF, a (K-NJF) number of K should be inserted in the in te rmedia te s t ructure requir ing twice as many additional J bonds (equation 5) to obtain a feasible structure. For example the following final s t ructure is unfeasible:

(CH3 C O) (CH2C O) (CH2)-(CH)-(CH2) (C O CHs) "(OH)

J3 =1; NJF=3, NJJ=2 ; K=4; Zi i J i - 9; J=4

On the basis of the previous definitions (equation 1) and equations 2 to 7, the general feasibility criteria derived for l inear or branched s t ructures are shown in Table 3, where J is the number of subgroups J given by equat ion / .From Table 3 it can be seen tha t for the case where K>NJF an addit ional (CH2) is required in the previous example in order to obtain a feasible molecule. When NJF = 0 then J=0, in this case for K=I the final molecule is obtained only by combining the K group with a M group (CH3). This is the case, for example, of methanol (CH3)(OH) where M=I; J=0; K=I.

In the application of the feasibility criteria of Table 3, K and J are the total number of groups or subgroups of each kind tha t part icipate in the molecule irrespective of their valence. The criteria for the aromatic par ts of the s t ructures are those indicated in Table 1 and should be combined with the ones of Table 3 in the synthesis of mixed (aromatic - paraffinic) s t ructures. Considering tha t the new group character isat ion gives more detailed properties of the functional group, the feasibility criteria of Table 3 can be extended to different group definitions.

Table 3: Feas!bility criteria for linear and cyclic branchedstructures K < N J F K > N J F

Non cyclic s t ructures K S J 2 K <_ J + N J F Cyclic s t ructures K S J 2 K <_ J + N J F

J - 0 K < I . . . . . . . . . . . . . . . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : : : : : : : . . . . . . . . : . . . . . . . . . . . . . . . . . . . . . . . . . . . . - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : : : : : : : : : : : : : . . . . . . . . . . . . . . . . . : : : : : . . . . . . . . . . . . . . : . . . . . . . : : : : : : : : : : : : : : : : : : : : : : : : : . . . . . . . . : . . . . . . . . . . . . . . . . . . . . : : : : : : : : : : : . . . . . . . . . . . . . . . . . . . ~ . . . . . . . . . . . . . . . . . . . . . . . : . . . . . . . . . . . . . . . . . . . : : : : . . . . . . . . . . . . . . . . . : . . . . . . . . . . . . : . . . . .

3 2

............... _T._a_b_le__4._R_ev_isi~n_of t h e C o m b i n a t i o n Propert ies of U N I F A C Groups C o m b i n a t i o n Propert ies G r o u p s w i t h t h e s a m e C o m b i n a t i o n P r o p e r t i e s

...... ~ e t h a g r o u p s )

J ,2) ( C H 2 ) . . . . . . . . . . . . . . . . . . ]

I ___/;_. /_(___j.. ,13) ~i_ ' ....... ........................ (C H ) ....... . . . . . . . . i i . . . . . |

!. ...... '--..--.. ( J - - ;4) . . . . . . . . . . . . . . . ( C ) ...................................................................................................................................................................................................................

1 ............. _(_t. . ,~)_~_K,_l.) ..................... i . ~ n ~ C ~ ) .............. il ......... ~ C _ n ~ _ ~ _ ~ . . . . . . . . . . . . . . . . _(Cn~_C__N). ............................. (_C__n_~S_n) .........

!. " ( J ,3) (K,!) (CHC!). _ ( C ~ ~ ) [HCON(cHi) ;) ........ ( J ;4)(K;!) ( cc i ) -7 ......... i . . . . . . . . . . . i .__i .... l

.................... _(_/,_2).... K(_~ ,2)_ ~ ) _ ................ ( C H ~ C _ O O ~ Q ) _ ................ _ _ ( ~

....................................................................................................... .(C._.Q~_C_H2) .................. (_C.._.QNC.H.3.CH2) ......... (FCH2Q) ............................ (CH2S_) .............. ] ....................................................................................................................... (C2H402) i!

.......... " (J;3)---(K;2i ................................. (-cH--6-) ......................................... ( C ~ i ........................................ (e6-N-ieH2)--;) ............................. -(cH-S) ....... [ ~_(_..J.___,__2..)___tK_,3) .......................................... (._C__H2N) ................ !

(K,I) (CH2=CH) ( O ~ (CH3C0) ................ (CH.O).__ .......... l ......................................................................................................... i-CH----;C--~66)- ........................... --(HCOO) . . . . . . -ieHi-6)- ..................... ( e H ; ~ i l . . . . . . . . . . . . . . . . . . . . . (CsI-I4N) . (COOH) ((2-He-12-)-17 .... .-i ....... _.i---]]~ii) ....... l

(CHzNO2) (I) (Br) (CH=C)

................................................................................... C_!-(C=C)__ ............................................. (S_!H~) ......................................................... (CC!~F) ................................... ( H C C I D ......... CC~_)_ .............. {:CaHsO:) (CH3S)

................. (_c 9__NH...2) . . . . . . . . . . . _( _ C O ~ _C_H. >) ................ Lc__0.. E ( c_ ~ ..........................................................

.......... (K,2) .... (CH--CH)_ (CH2=C) . . . . . . . . . (,CH3N) (CsH3N) ......... (CC12) (CHNO~_) (C=C) (COO) (SiH2) (SiH20) (C4H2S)

[ ........................ (K,4) ................. (C-C) . . . . (Si) ..... ............. , ......... ii{-sio)ii---ii--iiiiiii-il i/iii --ii_i_i._i/__.i./__] [- ( I , 1 ) . . . . . . (ACH) ............. (ACF i ................................................................................ ]

._(E._!..)_ t A C C H 3 ) ( A C O H ) (.A_C_~2) k 4 C C l ) .(A_C__N02) ...................................

L (K, 1) (H, 1) (AC)

R e d u c i n g the Combinator ia l Size of the Problem

T h e g r o u p c h a r a c t e r i s a t i o n g i v e n in T a b l e 4 i n d i c a t e s t h a t t h e r e a r e o n l y 19 d i f f e r e n t c o m b i n a t i o n p r o p e r t i e s of t h e U N I F A C g r o u p s . T h e r e f o r e , u s i n g t h e f e a s i b i l i t y c r i t e r i a of T a b l e 3 a n e f f i c i e n t c o m b i n a t o r i a l s y n t h e s i s of b r a n c h e d m o l e c u l e s is i m p l e m e n t e d on t h e b a s i s of m e t h a g r o u p s , i.e. g r o u p s w i t h t h e s a m e c o m b i n a t i o n p r o p e r t i e s a s i n d i c a t e d in t h e f i r s t c o l u m n of T a b l e 4. I n t h e s y n t h e s i s of l i n e a r m o l e c u l e s t h e i n t e r m e d i a t e s t r u c t u r e s h a v e t w o f r ee a t t a c h m e n t s . H o w e v e r , t h e n u m b e r of f r e e a t t a c h m e n t s i n b r a n c h e d i n t e r m e d i a t e s t r u c t u r e s is a l w a y s l a r g e r t h a n two :

N F A - 2 + NV8 + 2NVr (non cyclic) (10) o r

N F A - NV3 + 2NV4 (cyclic) (11)

33

where NV3 and NV4 a r e the number of groups of valence three and four.

Computer programs based on the above combination rules and group classification can easily be developed [18] and consist on the following steps:

1. Definition by the user of the desired product or solvent property constraints and performance index.

2. Selection of the intermediate and terminal groups in an interactive way.

3. Generation of metha- Intermediate Molecular Structures with NFAs from 2 to 8, using the available metha-groups (intermediate) and satisfying the feasibility criteria. A maximum number of 12 groups in the Final Molecular Structures (FMS) is allowed. Then, each metha-IMS is replaced by all different possible combinations of the selected groups to form "real" IMSs.

4. In a similar way, pre-FMSs are obtained by adding (NFA-2) terminal groups to each IMS.

5. Screening of the pre-FMSs according to the physical property constraints.

6. Termination of Solvent Molecular Structures (SMSs) by adding to each accepted IMS different combinations of two terminal groups that conserve the molecule feasibility.

7. Screening of the synthesized SMSs according to the physical property constraints.

8. Ranking the selected products in accordance with molecular complexity and specific performance index, indicating their predicted physico-chemical or environmental properties.

The size of the combinatorial synthesis problem increases when considering branched structures because of the large number of free at tachments of the intermediate structures (equations 10 and 11) and the larger number of groups available (see Table 4). Usually, the UNIFAC or other group contribution methods for computation of activity coefficients or component fugacities are used in the case of solvent design. The application of these methods requires the availability of binary parameters between the groups participating in the molecule synthesis stage. Therefore, between the steps 3-4, 4-5 and 6-7 the molecular synthesis method eliminates all intermediate and final structures that contain pairs of groups (one or more) with unknown binary interaction parameters, limiting in this way the size of the combinatorial problem and reducing the computing time. The results of the synthesis procedure are i l lustrated with an example of solvent design for the separation of benzene from hexane by liquid extraction. For this example the following groups were chosen:

34

(C), (CH-O), (CHNH), (CH), (CH 3N), (CHNO 2), (C O O), (CH 2 CO), (C H2 NH), (DM F-2), (CH2), (O H), (CH 3 COO), (H COO), (CH2 NH2), (CH 3).

The only physical property constraint for an intermediate structure is to have a maximum solvent loss of 10%. For the final solvents the main physical constrains are: selectivity greater than five and molecular weight less than 240. In this example, 16 groups (intermediate:10 and terminal groups:6) are selected for the synthesis of solvents with a minimum of two groups and a maximum of 12 in the final structure. In this case 10 meta groups can be identified within the selected set of groups. An example of the number of structures that are generated in the different steps of the molecular synthesis process is given in Table 5. The direct combination of these groups to form structures from 2 to 12 groups results in the generation of 646635 structures. The results of Table 5 show that the use of feasibility rules, physical constraints and the lack of binary interaction coefficients between groups, leads to a significant reduction in the size of the synthesis problem.

However, when pure component properties are dominant in the product design, the size of the combinatorial problem is not limited by the availability of binary parameters. A sound strategy to handle this problem is to make a preliminary search of product candidates using only single and dual valence groups. Thereafter, it is convenient to select the main group families that lead to the most promising branched structures. Note that in this case a database search method may also be employed, provided that a large database is available.

2.2.2 Tes t or M o l e c u l e E v a l u a t i o n S t a g e

The test stages of generate & test methods, is closely related to the type of product design problem being solved. In this chapter, only solvent-based separation problems are considered. A separation operation requires specific values or ranges of solvent properties for each particular application. These properties determine the space of physical properties constraints that limit the search space of solvent structures. The solvent property constraints may have lower or upper bounds or both. Even though it is difficult to define the conditions for an optimum solvent, the solvents synthesised by molecular design can be ranked according to a performance index and molecular complexity. The development of molecular design applications to different separation problems therefore requires the identification of these physical constraints and the formulation of predictions based on group contributions methods.

35

Table 5: Solvent design for separation of benzene from hexane by liquid extraction

Number of Groups selected: Number of m e t h a - intermediate structures generated Number of m e t h a - pre final solventes Number of pre-final solvents - P r e - f i n a l solvents rejected by MW restriction - P r e - f i n a l solvents rejected by lack of binary parameters - P r e - f i n a l solvents rejected by solvent loss constraints Number of final solvents generated Number of final solvents that satisfy all physical constraints

Potential Solvents

16

2344 10552

101934 81303 14475

4120

8823 277

Selectivity Distribution Coefficient

(CH3)(CH2)2(CH2COO)2(HCO0) 8.8 (CH3)3(CH2)2(C)(HCO0) 3 7.5

0.85 0.76

Liquid Extraction

When selecting of a solvent for liquid extraction, it is important to consider all the separating operations involved in the liquid extraction process:

i) ii) iii) iv)

solvent extractor, raffinate removal from extract solute purification solvent recovery column.

The scheme shown in Fig.1 is typical for the extraction of a dilute component. If the solute is recovered by extraction from a dilute solution, the solute/solvent relative volatility should be much greater than one, and the solvent solubilities in the raffinate should be very low. Otherwise, economic considerations screen out liquid extraction as infeasible for the separation under consideration. Cockrem et a/.[20] indicated that the solute distribution coefficient and the solvent solubility in the raffinate (solvent loss) are usually the dominant properties for solvent selection in liquid extraction. Low solvent loss in the raffinate also determines raffinate-extract immiscibility. High solvent selectivity is also required to reduce the cost of solute recovery and purification from the extract. Solute - solvent azeotrope formation and high relative volatility for the solute - solvent pair can be assured if a minimum boiling point difference is required. In general, the evaluation of potential solvents for liquid

36

extraction is based on primary solvent properties and pure component properties (boiling points, heats of vaporisation, densities and molecular weights). The primary solvent properties: selectivity, distribution coefficient, solvent loss and solvent power can be obtained from UNIFAC group contribution predictions of infinite dilution activity coefficients. The pure component properties of the solvent structures generated with UNIFAC groups can be estimated by group contribution methods (Pretel et al.[11], Gani and Constatinou [17]. The primary solvent properties can be estimated through the expressions given in Table 6:

Pretel et al. [8] evaluated the performance of the UNIFAC method with respect to its liquid-liquid and liquid-vapour group interaction parameter tables. Their conclusion is that the vapour-liquid parameter table renders more reliable predictions at infinite dilution conditions than the liquid - liquid parameter table. In addition, there are a greater number of groups and parameters available in the liquid-vapour parameter table and its revisions (Gmehling et al., 1982; Macedo et al., 1983; Tiegs et al., 1987; and Hansen et al., 1991), than in the liquid-liquid parameter table (Magnussen et al. 1981).

Table 6. UNIFAC Evaluation of Primary Solvent Properties for Liquid Extraction

Property (mass basis)

Solvent Selectivity

Solvent Power

Solute Distribution Coefficient

Solvent Loss

Estimate

/3 = MWA r;,sMW

S p - M W A

t r / - - r;,s

1 MW s

37

Feed A+B

Extractor

Extract A+B+S

B v

Raffinate removal column

Solvent

- - 1' J

Solute A

Solvent and solute separation column

Figure 1. Typical cycle for the extraction of a dilute solute

Extract ive Dist i l lat ion.

The s tandard extractive distillation process works in two steps: the extractive distillation column and the solvent recovery column. The pr imary solvent properties are the degree in which the solvent increases the relative volatility between the mixture components, the normal boiling point difference between the solute and the solvent, and the amount of solvent required to break the azeotrope in the case of an azeotropic feed mixture. Another important constraint is that the solvent should be miscible in the mixture at the desired concentration range. This constraint is assessed with the phase stability criterion proposed by Michelsen [21]. Furthermore, the feed concentration should be considered in selecting the component that should removed from the top of the extractive distillation column. This choice determines the nature of the solvents to be generated with the purpose of increasing or decreasing the r of the feed mixture. The CAMD procedure estimates the solvent properties on the basis of activity coefficients and pure component properties, on the basis of group contribution methods based on UNIFAC groups. The computation of the desired properties on the basis of these est imates is given in Table 7.

38

Table 7. M O L D E S Property Estimates for Solvent Evaluation for Extractive Distillation

Property

Relative Volatility

Solvent Power (mass basis)

Minimum amount of solvent to break the azeotrope (molar fraction) Phase Stability Criterion

Performance Index

Estimate

P:

S p ~ m 1 MW A

Y~,s MWs

Xms,[O~,B,A ]xms -1 .

~S,azeotrope Xms ~-~ 1.0

O~B, A 1

M W S x m

2.3 APPLICATION EXAMPLES

Application of a CAMD method based on the generate & test approach is highlighted through two examples involving solvent-based separations.

2.3.1 S o l v e n t for e t h a n o l r e c o v e r y

The ethanol recovery from aqueous solutions is a problem of great industrial interest. Ethanol recovery and dehydration by distillation and azeotropic distillation is very energy intensive. The potential of liquid extraction for this application can be readily explored by CAMD. The search of a potential solvent for this application illustrates the effect of physical property constraints, on solvent selection. The solvent properties desired for this application are:

fl > 7.0 ( w t . / w t . )

T b s - TbA > 50 K

m> 1.0 (wtJwt . )

S1 < 0.1 wt . %

Molecular design results for several homologue families of organic solvents are shown in Table 8. The low selectivities of alkyl amines and diols exclude all the components of these families as potential solvents. Even though all families satisfy the boiling point difference, the requirement of distribution coefficients greater than one rejects all solvents with MW

39

greater than 100. However the solvent loss restriction precisely requires higher molecular weights (>140, more CH2 groups) for the alcohols and carboxylic acid families; therefore no solvents that meet all the specifications can be found. We can say that molecular design excludes liquid extraction as a feasible operation for this particular problem.

Table 8. Effect of solvent property constraints on ethanol extraction from aqueous solution solvent molecular design

Solvent Family

Phenyl acids

Alcohols

Carboxylic acids Diols

Alkyl amines

fl> 7.0 (wtJwt.)

(+)

(+)

(+)

(-)

(-)

TbS- TbA > m > 1.0

50 K (wt./wt.)

(+) (-)

(+) (+) if MW<100 (-) if MW>100

(+) (+) if MW<100 (-) if MW>100

S1 < 0.1 wt . %

(-) if MW <140 (-) if MW < 140

In the synthesis of linear paraffinic solvents for the extraction of ethanol from water using the following 13 groups: (C5H3N) (CH2CO) (CH2COO) (CH20) (CH2NH) (CH2) (OH) (CH3COO) (CH30)(C5H4N) (COOH) (CH2NH2) (CH3); the molecular design program selects 1050 intermediate structures and 213 final structures and generates 99 final solvents for which the information on binary coefficients is available. However, as mentioned before there was no liquid solvent that met all the primary properties required. The design of solvents for the recovery of other oxyehemieals from aqueous solutions, like furfural, butanol, propanoie and acrylic acids is successfully accomplished by molecular design and the results agree with experimental results for these systems [8].

2.3.2 Solvent for separation of n-propyl acetate from n-propyl alcohol by extractive distil lation

The separation of n-propyl acetate from n-propyl alcohol serves to illustrate the application of MOLDES for the synthesis of potential solvents. The solvent should exhibit the following properties:

40

as, A _> 3.0

Sp >_ 30.0, wt% (7)

Tb s - Tb A > 50K

The best solvents found by MOLDES are shown in Table 9, together with experimental relative volatility values obtained by Cepeda and Resa (1984).

Table 9. CAMD solvent selection for the extractive distillation of n-Propyl Acetate

Solvent

Ethylbenzene Nonene n-Decane Chlorobenzene Decalin Chloroctoane Xylene Dichlorobenzene Mesitylene

rrom n-Prop~,l Alcohol at atmospheric pressure

a B , A

5.4 4.64 5.26 3.71 4.64 4.95 3.95 3.1 2.32

O~B,A, exp

4.23

4.63 4.7 4.63

4.37 4.79 4.24

Sp Xms PI%

80.6 35.9 14.2 46.45 35.7 10.31 35.7 37.6 9.84 86.0 33.7 9.78 42.6 34.1 9.71 47.11 34.8 9.57 67.1 40.9 9.1 60.93 30.6 6.89 35.7 47.8 4.05

For this separation problem ethylbenzene, nonene, n-decane and xylenes are the most attractive solvents. From their experimental study Cepeda and Resa recommended the use of xylenes and saturated hydrocarbons with more than 9 atoms. If the reverse problem is studied, that is, if propyl alcohol is the solute of the extractive distillation column and it is removed from the bottom, together with the solvent, the selection changes drastically and now the best solvents are Ethylene Glycol or Propylene Glycol (Pretel et al.[8])).

2.4 R E F E R E N C E S

1. R.Gani and E.A.Brignole, Fluid Phase Equilibria 13 (1983) 331 2. E.A.Brignole, S.Bottini, R.Gani, Fluid Phase Equilibria 29 (1986) 125 3. Aa.Fredenslund, J.Gmehling and P.Rasmussen, "Vapor liquid

equilibria using UNIFAC", Elsevier Scientific, Amsterdan, 1977. 4. V.Venkatasubramanian, K.Chan, J.M.Carutheres, Computer

Chem.Eng 18 (1994) 833.

41

5. K.G. Joback and G.Stephanopoulos "Designing molecules possessing desired physical property values" Proceedings FOCAPD'89, Snowmass, CO, 1989.

6. N.Churi, L.E.K.Achenie, Ind. Eng. Chem.Res. 35 (1996) 3788 7. P.M.Harper, R.Gani, P.Kolar, T.Ishikawa, Fluid Phase Equilibria, 158-

160 (1999) 337 8. E.J.Pretel, P.Araya LSpez, S.B.Bottini, E.A.Brignole, AIChE Journal

40 (1994) 1349 9. R.Gani, B. Nielsen, Aa. Fredenslund, AIChE J. 37 (1991) 1318 10. O.Odele, S.Machietto, Fluid Phase Equilibria 82 (1993) 47 11.Aa.Fredenslund, J.Sorensen, Ch.4, "Group Contribution Methods" in

"Models for Thermodynamic and Phase Equilibria Calculations", editor S.I.Sandler, Marcel Dekker, Inc., New York, 1994.

12.S. Skjold-Jorgensen, Ind.Eng.Chem.Res. 27 (1988) 110 13.H.P.Gros, S.Bottini, E.A.Brignole, Fluid Phase Equilibria 116 (1996)

537. 14. S.Espinosa, G.Foco, A.Bermfidez, T.Fornari, Fluid Phase Equilibria

172 (2000) 129 15.R.C.Reid, J.M.Prausnitz, B.E.PSling ,"The properties of gases and

liquids", 4 th Ed. Graw Hill Inc., New York, 1987. 16. E.Pretel, P.Lopez, A.Mengarelli, E.Brignole, Latin American Applied

Res. 22 (1992) 187 17. L.Constantinou, R. Gani, AIChE J 40 (1994)1697 18.M.Cismondi, E.A.Brignole, Proceedings of the 11 th European

Symposium on Cumper Aided Process Engineering, Denmark, May 2001, Edited by R.Gani and S.Bay Jorgensen, Elsevier, ISBN:0-444- 50709-4, pp.375-380.

19. Cockrem, M., J.Flatt and E. Lightfoot, Sep.Sci. and Technol., 24 (1989)769

20. E.Cepeda, J.M.Resa, An.Quire. 80 (1984)755

Computer Aided Molecular Design: Theory and Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian (Editors) �9 2003 Elsevier Science B.V. All fights reserved. 43

C h a p t e r 3: O p t i m i z a t i o n M e t h o d s in C A M D - I

M. Sinha, L. E. K. Achenie & G. M. Ostrovsky

3.1 I N T R O D U C T I O N

Chemical product design addresses the design of single component chemical compounds and/or mixtures (blends) of compounds with pre- specified thermo-physical properties. In recent years, the traditional wet chemistry based chemical product design is being supplemented with computer-aided approaches. The latter is formally designated as computer-aided product design. To be consistent with this book, we will employ the more conventional name, namely computer-aided molecular design (CAMD) in this chapter. The CAMD problem can often be posed as a mathematical program in which a number of binary and continuous variables define the search space (Duvedi and Achenie, 1996; Churi and Achenie, 1996; Maranas, 1997; Odele and Machietto, 1993; Pistikopoulos and Stefanis, 1998). A binary variable is an integer variable that can have one of two possible values, for example 0 and 1. This chapter discusses a branch and bound approach to solving the resulting mathematical program.

3.2 PROBLEM DEFINITION

A typical molecular design problem may be modeled as a single objective minimization or maximization subject to structural and performance constraints. Thus a CAMD problem for single component molecular design in which thermo-physical property matching is sought may be modeled as

min f (x,v,O) (1) x,v,O

(pj(X,V,O)~O, j = l . . . . , m 1 (2)

h i (x, v,O) = 0, i = 1,..., m 2 (3)

w h e r e , is a vector of binary variables that define the molecular structure, x is a vector of continuous variable such as process variables (pressure, temperature, etc.) and 0 is a vector of group contribution parameters. Note that additional binary variables may be included in , to indicate additional constraints on the kind of molecular structures that can be

44

generated, f (x ,v ,O) is the performance objective function (for example some undesirable property such as a compound's ozone depletion potential). The group contribution model is a structure-property correlation that has found wide use in the chemical process industry.

The constraints involve (a) structural feasibility, (b) physical property targets, and (c) process constraints. The constraints associated with structural feasibility are usually linear. Physical property targets often have the form p~ <pk(x,v,O) <p~. If pk(x,v,O) is modeled using group

1 2 1 contribution, then it may have the form Pk - ~_~ njOj / ~_~ njOj . Here 0j and J J

2 1 2 0j are elements in 0 and nj is the number of 0j or 0j present in the

molecule. Transformation of such constraints into a linear form is straightforward. The function Pk (x,,,0) can also have the form

(j~ a ) / f 2 ( j~ b /where f~L and f~L are nonlinear functions, Pk - flL YljOj NL njOj

a b and Oj and Oj are parameters. Property constraints, which employ the

given form, include, solubility parameter based models often used in solvent design. It is not always possible to reformulate these constraints into linear or convex forms.

The nonlinear mathematical programming model for the CAMD problem (PMD) has the following features: (a) it is a nonconvex mixed integer nonlinear problem (MINLP) problem involving a large number of binary variables, (b) the number of linear constrains is larger than the number of nonlinear constraints, and (c) most of the components of the design vector (u) participate in the nonlinear terms. Previous attempts using global optimization are either geared to small size problems or use soft computing approaches (such as simulated annealing and genetic algorithms). The approach discussed here is based on the branch and bound (BB) algorithm. The basic BB algorithm may encounter a large number of branching variables for product design problems. To address this, the branch-and-bound global optimization algorithm presented here exploits the problem structure and allows significant reduction in branching expressions. A discussion of the algorithm is based on the papers (Sinha, Achenie and Ostrovksy, 1999) and (Ostrovksy, Achenie and Sinha, 2000).

In group contribution based computer aided single component product design, solvents are formed from certain combinations of a set of structural groups. The pre-specified set of m structural groups is called the basis set. The size and composition of the basis set depends on the intended application, the availability of accurate property prediction models and the computational resources available. First, we define a set of variables based on an initial set of structural groups as

45

u ~

if the i - th group in the molecule is the k - th structural

= group in the basis - set C h u r i - A c h e n i e m o d e l

otherwise

/'/i

if the i - th structural group in the basis - set

is in the molecule

otherwise

O d e l e - M a c h i e t t o m o d e l (4)

Odele and Machie t to (1993) proposed a formulat ion tha t ensured t h a t the valence of each s t ruc tu ra l group was satisfied. This formula t ion only accounts for the presence and absence of s t ruc tu ra l groups in the molecule. However, it does not consider the informat ion t h a t de te rmines how the groups are connected to each other in the molecule. To overcome this l imitat ion, Chur i and Achenie (1996) proposed a model t ha t gives complete informat ion with regard to how the groups are connected to each other. Presen t ly there is no known group contr ibut ion method tha t t akes advan tage of the connectivity informat ion of the Churi-Achenie model. In the l a t t e r model, the following var iables were int roduced

10 i f the i - th group's j - th site is attached to the pth group

Z UP -- otherwise

_ J1 if the i- th group in a molecule does not have a group attached w i [0 otherwise

(5)

For single component solvents s t ruc tura l const ra in ts are imposed for (a) l imi t ing the n u m b e r of s t ruc tura l groups in a molecule; (b) ensur ing t h a t the n u m b e r of bonds a t tached to a group equals the valence of the group; and (c) ensur ing t h a t each group in a molecule is a t t ached to at leas t one other group. The formulat ion is effective in specifying whe the r the molecule is acyclic or cyclic. Moreover the m a x i m u m n u m b e r of cycles can also be controlled. This represen ta t ion is also effective in d is t inguish ing be tween isomers. If the chemical process is not accounted for, then the pure component molecular design problem involves only b inary variables. The m a x i m u m n u m b e r of groups in a molecule is nmax; the n u m b e r of groups in the basis set is m with the m a x i m u m valence of Smax. In this case the search dimension is then given by n m a x X m + n m a x X S m a x X n m a x + n m a x . Here the n u m b e r of b inary var iables is equal to the sum of the dimensions of u, z and w, respect ively (assuming the Churi-Achenie model is used). The n u m b e r of l inear s t ruc tu ra l const ra ints employed are nmax 2 + n m a x X m +

3nmax + Sm~x + 1. For example, a CAMD problem with n m a x - 5, m = 10,

46

and Sm~x = 2 results in 93 linear constraints. The number of nonlinear constraints is generally small compared to the number of linear constraints. Let all the binary variables in the problem be assembled in the vector v (q-dimensional). If the Odele-Machietto model is employed then v -u ; on the other hand if the Churi-Achenie model is employed then

v-[u , z ,w] . Then the solvent design problem (see (1), (2), (3)) can be expressed compactly as a mixed integer nonlinear program in the general form

P: f = min f(x,v) (6) x, ve D

such that

D = {x,v "c< x < d,q) i (x,v)<O,

i=l,...m, h(x,v)=O, xe X c_gU, ve {0,1} q}

3.3. D E S C R I P T I O N OF THE P R O P O S E D METHOD OF SOLUTION

3.3.1 B r a n c h - a n d - B o u n d A l g o r i t h m P r e l i m i n a r i e s

The branch and bound (BB) method (Horst and Tuy, 1990) has been used for solving several problems in chemical engineering (Ostrovsky et. al., 1990, Friedler et. al., 1998, Quesada and Grossmann, 1995, Ryoo and Sahinidis, 1996, Maranas and Floudas 1997, Adjiman et. al., 1998). The generic BB method looks for a minimum of the objective function f (x,v) by parti t ioning the region D into subregions Di with respect to the search variables. At each iteration, a subregion Di is further partitioned into Dip and Diq (Di - Dip ~ Diq). The generic BB method consists of the following:

(i) An algorithm for estimating a lower bound (LB) #i on the

objective function f ( x , , ) in any subregion Di eD such that

#i <- f ( x , v ) Vx, ve D i

(ii) An algorithm for estimating an upper bound (UB) r/j on f ( x , v ) i n

any D i e D such that T/j > f ( x ,v ) Vx, ve Dj

(iii) An algorithm for partitioning Di

Designate the set of subregions at the k-th iteration of the BB method as L(~) - ( Di , i=l .... ,Nk). Let I(k) be the index set of the subregions

belonging to L(h). Then the algorithm for the BB method is as follows

Step 1. Set k=l. Give an initial set L(~ of the subregions Di (i=l,..,.No, usually No-l) .

47

Step 2. Calculate an LB for each Die L(h)

Step 3. De te rmine the subregion with the leas t LB. Let it be the lm- th region then

= min Pl (8) Plm lei(k)

Step 4

such t ha t

Split Dt~ into two subregions Dp and Dq (Dim - Dpu Dq)

Vp --{X" X~- Vlm,Xs ~ Cs}~Vq -'{X" XE Dlm,Xs ~ Cs}

The variable, xs, is the branching variable and Cs is the branching point.

Step 5. De te rmine LB and UB for subregions p and q.

Step 6. De te rmine the least upper bound 7/(k) at the k- th i terat ion.

~(k)- min O? k~, 77p, 77q) For the first i tera t ion 77 (~ = oo

Step 7. If ~(k) _ Plm < e then STOP.

Step 8. If pj > 7/k (9)

is met for j - p or j - q then the corresponding subregion is e l imina ted from consideration.

Step 9. Form a new set L&)of the r emain ing subregions as follows

L(k)= (D1 , . . . , Dt~-l ,, +1,.", DNk ,L)

where D p , D q if pj <71 (k) j - p,q

L = Dq, if pq <7/(k) < pp

Dp, if pq < 71 (k) < pp

Step 11. Set k=k+l, and go to Step 3

48

Each BB method needs to develop algorithms for partitioning and for est imating lower and upper bounds. Thus we describe algorithms we have developed for estimating lower and upper bounds for the mixed integer nonlinear program arising from our formulation of the computer aided molecular design problem. Let us consider the partitioning algorithm. At each iteration in a s tandard BB method, the "optimal" subregion Dtm is

parti t ioned into two subregions Dp and Dq using the constraints x i < x} �9 �9

and x i > x~ or vi _< v~ and v~ >_ v i as follows

Dp : { x ' x ~ Dtm,X ~ <_cs},D q : {x ' xE DIm,X s >_Cs}

The variable, Xs, is the branching variable and Cs is the branching point . Different BB methods have different ways of selecting these. Thus in this case n+q branching variables are used. In a realistic product design problem, the number of branching variables can be several hundred. It is known that the number of branching nodes grows exponentially. To alleviate this problem, we will use the following new partition algorithm. Instead of branching on the variables (x,v), we will use appropriate

functions ,gj(x,v), j = l .... p of the search variables for branching.

Subsequently, Di will be determined by the set of inequalities

aj < iltj(x,v ) _bid = 1,...p,

where the lower and upper bounds aj i and bj i are the dimensions of the multidimensional box (subregion) Di are determined by the branch-and- bound strategy. Thus Di has the form

Oi--{x,v'x, la~.O,'a ~. <_ l,gj(x,r) <_b~,j -1 .... p, }

Problem P for subregion Diis written as

PiL: f / : min f(x,v) (10) x,v~ D~

A direct solution of the above problem is very difficult. Instead, the approach to be described finds the solution indirectly by successively est imating lower and upper bounds for the performance objective function f/ . In the limit, these bounds should collapse into one to give a solution to

the above problem. Thus it is appropriate to discuss how these bounds are obtained.

3.3.2 Lower B o u n d Algor i thm

A lower bound f L for f~ on Di is obtained by solving the following problem

49

PiL: f /L= min L[f(x,v);[)i] X,V~_ D i

where = ., <<_bj;L[-gtj Di] <_-aj;ve {0,1} q } Di={x~v.'L[tPk;Di]<_O, k 1,.. m; L[gtj;D~] i i

and L [ g ( x , v ) ; D i ] i s a convex underes t imator for the generic function g ( x , v ) .

Then it is easy to verify tha t Di c D i .

Some al ternat ives for es t imat ing lower bounds are: (a) The use of l inear or convex nonlinear underest imators ; (b) Enforcing the integral i ty of all the b inary variables v at each i terat ion (Pantelides, 1996); (c) The variables v are considered as continuous variables such tha t 0 < v < 1. In the lat ter , the variables become binary only at te rminat ion of the algorithm. We will construct l inear underes t imators and we will enforce integral i ty of v at each i terat ion as in (b). The resul t ing problem (Pi L) is a mixed integer l inear program (MILP).

3.3.3 Upper Bound Algorithm

The upper bound f [ for f/ on Di can be found by computing fi U - f ( ~ , v) , _

where [~,v] is a feasible point for problem (10). The la t ter can be obtained

by solving

y*- min y X , v , ) t

r (x,v)< y, j = 1...(m + 2p)

(11)

where

[ q~j,j = 1...m m i

(4)j - : ' { - - IV j_m + aj_m, j =(m + 1)...(m + p)

[-bj_(m+p ) +lprj_(m+p)j = ( m + 1 + p ) .... ( m + 2p)

This is a nonconvex problem and therefore computat ional ly intensive to _

solve at each iteration. To circumvent this, we obtain an upper es t imate y

of the value 7" by solving the problem _

y= min y pu: x,w

U['~] (x,v);Di] _<~t, j = 1...(m + 2p)

50

where U[~j (x,v); D~] is a linear overestimator of ~j (x,v) on D i such that

U [ ~ j ( x , v ) ' , D i ] > _ ~ j ( x , v ) ' , V ( x , v ) E D i . Let D--]={x,u'U[~)(x,v);D~]<O}, then _

D~D~ and y>y*. Problem pv is an MILP. It should be noted that we

could terminate the solution to pv whenever y <0.

During evaluation of the lower and upper bounds for subregion Di, the following situations may arise at the k-th step of the branch-and-bound

_

algorithm" (i)/3 i;~O, ?' <0, (ii)/)i r ~, y >0, and (iii)/)/- O. In (i), we can

calculate both the lower and upper bounds, while in (ii) we can only calculate the lower bound since we cannot ensure that the point obtained by solving problem p v will be feasible for the problem (3.3). Finally in (iii),

D i does not contain solution points and consequently it can be excluded

from consideration. The branching point V* = V(x*,u*)is determined at the

solution point of the lower bound problem pL.

3.3.4 Linear Es t imators and B r a n c h i n g F u n c t i o n s

The main challenge in a BB based method is the construction of underest imators and overestimators. McCormick (1976) suggested the factorable programming technique for constructing convex underest imator for a function represented in factorable form. Sherali and Alameddine (1992) suggested a general approach for constructing underest imators for arbi t rary polynomial functions. A method for construction of underest imators for more general functions is proposed in the a-BB global optimization method (Adjiman et al, 1998). The dimension of the lower bound problem, in all the above approaches, can be much larger than the dimension of t h e original problem. Here we present an alternative approach in which the lower bound problem has dimension not greater than the dimension of the original problem.

Let us consider a class of functions qgi that can be represented as a tree graph (Fig. 1). Denote the root node of the graph as A1N. The set of nodes Aj Nk, which are k branches apart from the root node, are at the g-k th level. Let the k-th level of the tree graph has pk nodes. Each node Aj gk has qjN-h descendants. Assign a differentiable function ~pj<N-k) of many variables and

( N - k ) . q/(N-k)continuously differentiable functions fji (y) of one variable y to

each node Aj Nk of the (N-k)- th level. (k=1,..,~-1). The original function ~Pi

corresponds to the root node. Thus the following relations hold

- - o i ( N - k ) ~ ( N - k ) r ( N - k ) - - L C j i Y j i ( ( N - k - l ) ,

.j ~ ~ _ k

(12)

51

Figure 1: A mult i level r ep re sen t a t i on of a t ree

where Qj(N-k) is the set of descendan t nodes of Aj (N-k). The var iab le xi

corresponds to a leaf node. Wi thou t loss of general i ty , we will a s s u m e t h a t the leaf nodes are associa ted with the first level. Otherwise we employ ident ical t r a n s f o r m a t i o n s to re la te the var iable xi to the first level. Suppose for example the var iable xi is associa ted wi th the second level. T h e n we can in t roduce the t r ans fo rma t ion q912)= x i . In so doing we have

r e l a t ed x i to the first level as well.

52

A function f(x) is defined as a special tree function (STF) if each node of the computational graph corresponding to it is characterized by relation (see Eq. (10)) . Thus the STF is a superposition of univariate concave or convex functions connected by simple arithmetic operations, namely addition, subtraction, multiplication on some constant coefficient and

( N - k ) .. operations corresponding to univariate functions f i (y) in intermediate

nodes. There exists different ways for t ransformation of a tree function into an STF. The simplest way consists in the use of the following t ransformat ion for removing the multiplication operation.

f (x)g(x) =�88 + g(x))Z--~14 ( f (x ) - g(x)) z

We propose a strategy for constructing a linear underest imator for the

function r N) corresponding to the root node A1N . Note tha t r a

complex multilevel function of the variables Xl,...,x, at the first level. We

will assume that all the coefficients Csi (N-k) are positive. If a coefficient

csi (N-k) is negative we can introduce new notations ~si(N-k)=--Csi (N-k) and _ ( N - k ) _r ( N - k ) - . ~i (N-k)= - f i (N-k), and replace %i Ys~ by Cj~(N k)fj (U k) Here -~s~ (N-k) >0.

Let

q)j(~-k) e Sj(yhl )

where

= cpj -~(N-k-1)

--< rjO(N-k-1) -- < ~o j } (13)

If we know the bounds for the variables xi at the first level, we can est imate bounds for all functions r (at all levels) by using interval ar i thmetic (Moore, 1966). A linear underest imator of the function q~(N-k) in the region Si (N-k) with respect to the functions ~j(N-k-1)j~QNi-k will be designated as L[~(Nk), �9 Si(Nh)].

(N-k)'s(N-k)l and the l inear One can find a linear relation between L [ q ) i , i J

underes t imators of q)i (N'k'l) at the descendant nodes Aj(N'k'O as

c (N-k)L[fJN-k) (N-k- "S (N-k-l) c(N-k)fj(N-k) (N-k-O = ' ) ) , ]<- . . . . ) j e_ Q : - k j e Q~i - k

(14)

Now we will construct a linear underest imator for the function f (N-k) (q)j(N-k-~) (N-k-l) ji . ) at the (N-k ) - th level with respect to r . Let the lat ter

satisfy the Eq. (13). To simplify the notation for subsequent developments, let Y=ePj(N-k-I) and consider the function f(y) in the region

53

m Sy = {y:fi < y < ~}. If f(y) is concave then in Sy the linear underes t imator

has the form

[ f (Y)- f(Y)] L[f(y);Sy ] = f ( y ) + = _ (y-y) (15) y - y

If instead f(y) is convex then a linear underes t imator is given by the

(Y + ~). In this case the underes t imator is tangent to f(y) at the point Ym = 2

given by the following formule

L[f (y); Sy ] - f'(Ym)(Y--Ym )+ f(Ym) (16)

Here f'(Y,O is the derivative of the function f(y) at the point ym �9

Subst i tut ing the expressions for linear underest imators of fie (N-k) in -

~(u-k) we obtain

L[q)i(N-k);s~(N-k)]= ~_. dq~j (N-k-~) (17) EQ N-k j

Again we will assume that dj>0; otherwise we can employ the

t ransformat ion discussed earlier. Hence we finally obtain the following expression for the linear underest imator of

( N - k ) ~0 i as ,

; S ( N - k - l ) L[q)i (N-k) Si (N-k)] - ~ d L[goj(N-k-1); j ] j eQi u-*

(18)

At the (N-k-1)-th level, we need to know the sign of dj which is

determined at the upper level, (N-k). Therefore, s tar t ing from the N-th level and moving down to the 2-nd level, we obtain all relations as expressed in Eq. (18) for k=0,1. .~-1. A linear underes t imator for the function q#N) can be represented as a linear function of the variables xi (associated with the first level) as follows

N

L[~pl (N) ;$1 (N) ] - ~ c j x j (19) j=l

From the above consideration the following algorithm for construction of a l inear underes t imator for a tree function follows.

Summar i z ing , construction of linear underest imator involves:

1. A bottom to top sweep to obtain all bounds

54

- - k ~ k

[tp,, ~Pi ] V k = 1,...,N and/= 1,...,pk )

2. A top to bottom sweep to obtain the relations (Eq. (18)) for all levels 3. A bottom to top sweep to obtain L[tpi(N-k);S (N-k)] as l inear

functions of x, and u.

We will refer to this method as the sweep method. A similar procedure can be used for construction of l inear overestimators. It is impor tant tha t the underes t imator is a l inear function of the variables x andv. We note the following. The dimension of the lower bound problem pL is the same as dimension of the original problem P.

3.3.5. Se lec t ion of Branching Funct ion

In a conventional BB method, the branching variables are the search variables xi. However, the larger dimensionali ty of xi (i = 1,...,n) can resul t

in a rapid growth in the number of branches in the BB tree. To address this problem, we consider an al ternat ive selection of the branching

�9 /~ . ( N - k ) expressions: we employ the a r g u m e n t s (pj(N-k-~) of all the f unc t ions Jji as

b r a n c h i n g variables. Branching on ~0jCU-k-1)will decrease the intervals

described by (13). Therefore, a t ighter l inear underes t imators of fj ON-k) will ( f (N-k ) k) be obtained since max . j j~ -L[q)i (N- ;Si (N-k)]) will tend to zero as the size

x,v

of cj~(u-k)strives to zero. Only independent functions q~ can be used as

branching functions. The suggested approach to selection of branching expressions will be advantageous if the number of independent functions

._ ( N - k - l ) from the functions q)j is less than the number of variables

x i (i = 1,...,n). In our formulation of the molecular design problem, this is indeed the case.

3.4 S T E P BY S T E P A L G O R I T H M F O R S O L U T I O N T E C H N I Q U E

Step 1: Decide on the set of groups to be used to form compounds. Identify the design variables. The first set of design variables is v. The second set of design variables is x.

Step 2: Develop the performance objective f (such tha t it has the s t ructure of Eq. (1)). Ensure tha t the performance objective can be calculated directly or indirectly using a group contribution property model. Also make sure one or more of the design variables affects the performance objective directly or indirectly.

Step 3: Develop the property constraints (such tha t it has the s t ructure of Eq. (2)). Ensure tha t these constraints can be

55

calculated directly or indirectly using a group contribution model. Also make sure one or more of the design variables affects each constraint directly or indirectly.

Step 4: Develop structural feasibility constraints (i.e. Octet Rule model such that it has the structure of Eq. (2)). Examples are Odele- Machietto Octet Rule model (1993) and Churi-Achenie Octet Rule model (1996).

Step 5: Using information from previous steps, assemble the mathematical program, i.e. the performance objective, constraints, design variables and the Octet Rule Model.

Step 6: Construct linear estimators of the performance objective and the constraints by the Sweep method (see Section 3.3.4 and the illustration in Section 3.6)

Step 7: Find optimal structure of the molecule by using the BB method from Section 3.3.1.

3.5 METHODS AND TOOLS

To use the solution technique, you will need the following:

(a) Group contribution based property estimation methods to calculate he needed physical properties. More details on these methods are given in chapter 1 of this book.

(b) An MILP (mixed integer linear program) solver to be used in Step 7. An MILP solver can be found from commercial software such as CPLEX (www.cplex.com) and OSL (www.research.ibm.com/osl). The public domain code lp_solve by Har tmut Schwab (ftp.es.ele.tue.nl/pubflp_solve) can also be used.

(c) An implementation of the branch and bound procedure from Section 3.3.1.

3.6 A P P L I C A T I O N EXAMPLE

To illustrate the CAMD design procedure using the branch and bound (BB) method, we use a simple example. Note that this is not a CAMD problem and yet it has the structure of a CAMD problem. A CAMD example is found in Chapter 10.

S t e p 1 o f S e c t i o n 3.4

Here assume that the structural groups in the basis set have been given and that they have been labeled 1 through 4 - the number of structural

56

groups. Assume tha t the Odele-Machietto model is employed then v - u .

Therefore the first set of design variables is u = [ul, ue, u3, u4]. For example if u1=1, then s t ruc tura l group 1 is present in the molecule; o therwise u1=0 and s t ruc tura l group 1 is absent. The second set of design var iables is x = [Xl, xe] represent ing for example two properties.

Step 2 of Sec t ion 3.4

Suppose tha t the performance objective function is given by

f ( u , x ) = alu ~ + a 2 u 2 + a3u 3 + a 4 u 4 + asx ~ + a6x 2 (20)

Steps 3 and 4 of Sect ion 3.4

Suppose also tha t the molecular s t ruc tura l constraints and the proper ty cons t ra in ts are given by

q~o(U) ~ Ul " a t ' . . . " ] - U 4 - a < O

(/91 (U,X) ~= a l lU 1 + . . . + a l4u 4 + alsulU 2 + a l 6 x I + c I _ 0

(P2 ( u , x ) ~ a21ul + ... + a24u4 + a25u3/,/4 + a26x 1 + c 2 _~ 0

(21)

(22)

(23)

Here ui are b inary variables (for a real CAMD problem this would represen t the presence or absence of s t ruc tura l group number i from the basis set - Step 1of Section 3.4). In addition, xi (for a real CAMD problem this would represent a property of interest) are continuous variables and ai i are known constants tha t appear in the model.


The resul t ing CAMD model is

min f (u,x) (24) x ,u

~0 i (/,/, X) ~_~ O, i = 0,1,2

u ~ {0;1}, x L ~ x ~ x U


In the BB method we need to construct l inear underes t imators for nonl inear constra ints and find branching functions. For this we mus t const ruct the special tree functions (STF) for nonl inear constra ints (22) and (23) (since they contain bi l inear terms). Using the t rans format ion in (12) we obtain the STF for the constraints as follows

(pl(l.t,x)=~allUl + . . . + a l 4 u 4 + 0 .25a~5(u 1 + u 2 ) 2 - O . 2 5 a l s ( u l - - U 2 ) 2 + a l 6 x 1 --I-c 1 ~ 0

(25)

57

q)2 (/,/,X)--= a21//1 + . . . + a 2 4 u 4 + 0 . 2 5 a 2 5 ( u 3 .-t-//4) 2 - 0 . 2 5 a 2 5 ( u 3 - / / 4 ) 2 + a 2 6 x 1 + c 2 _<0

(26)

From Section 3.3.5, the arguments in the nonlinear functions in the STF are the branching functions. Consequently, the four functions r = uj + u 2 ,

r = u~ - u2, r = u3 + u4, and r = u3 - u4 are the branching functions. The

part i t ioning of D into subregions will be accomplished with the help of the branching functions. Thus, the i-th subregion Dg will have the form

i < //1 + U2 ~-~ i i < Ul _ / / 2 < i D i = { u " a 1 _ b 1 , a 2 _ _ b 2

i < //3 "~ U4 ~ i i < //3 //2 ~ i a 3_ b 3,a 4_ - b4,0<_uj<l} (27)

i i 0=1 ..4) are determined by the BB procedure. If where the bounds a j , b j , ,

there is no initial part i t ion of D then

1 1 1 1 a~ = a 31 _ O, b] - b~ = 2 , a 2 - a 4 = - l , b 2 = b 4 - 1 (28)

In the problem constraints (22), (23) are nonlinear and nonconvex since they contain bil inear terms (i.e. l inear with respect to each variable) ulu 2

and u3u 4, respectively. In order to find the globally optimal solution of

problem (24) the BB method solves an approximate (i.e. relaxed) version of problem (24) in which a ui is allowed to take any value (including fractional values) between 0 and 1.

S t e p 7 o f S e c t i o n 3.4

Thus, BB solves min f(u, x)

x,u

~o, (u ,x ) <_ O,

D = {u :0<_u <_1}

i = 0,1,2

x L <_x<_x v

(29)

Thus for the i-th subregion, (29) becomes

f i = min f (u ,x ) X,UE D i

q~ (u ,x ) < O,

X L < ~ X < ~ X U

i = 0,1,2

(30)

The BB obtains a solution indirectly by generating a series of lower and upper bounds. In order to obtain a lower bound o f f i we must solve the

following problem (designated a s P i L in Section 3.3.2)

58

min f ( u , x ) x,u~O i

% (u, x ) < O,

Z [ ( p i ( u , x ) ; D i ] <_ O,

x L < x < x v

i=1,2

(31)

Each L[q9 i (u , x) ; D i ] is of the form m i

L [ ( p 1 ; D i ] = a l l / g I + ... + a14z/4 + 0.25alsL[(Ul + u2 )2., S[,2 ]Av 0 . 2 5 a l 5 L [ _ ( U l _ u2 )2., $1,2 } + a16 x I + c 1

L[cpz'D~]=-a21u 1 , 4- ... 4- a24u 4 +O.25azsL[(u 3 +u4)Z 'S~,4]+O.Z5a2sL[-(u3, -- U4)2 "S i , 3,4 } 4- a26x 1 4- 6' 2

where �9 - - i �9

S[, z - {a; <_ u 1 + u 2 <_ b(} ,$1,~ = {a' 2 <_ u I - u 2 <_ b~} ,

S~, 4 {a~<u 3+u 4<b~}and~i = -- -- 3,4 = { a ' 4 --< U 3 --/ ' /4 ~-~ b~}

Us ing (15) and (16) we obtain L[(u , + u ~ ) 2 " S ~ ' -0.5(a~ +b])]+ 0.25(a~ +b[) ~- , , , 2 ] = ( a ~ + b 1)[u l + u 2

-- i ")2 " i i L[_(u l _ u2 ) 2 ;$1.2 ] = - ( a ' 2 - (a ' 2 + b 2 ) (u I - u 2 - a 2)

Z[(/g 3 -I-/g4)2 ; S ~ , 4 ] - - (a~ + b~)[u 1 + u 2 - 0.5(a~ + b[)]+ 0.25(a~ + b[) 2

L[_(u31 / / 4 ) 2 . s i i - , 3,4] = - ( a ~ ) 2 - ( a ~ +b~)(u 3 - - $ / 4 - a 4 )

In order to obtain an upper bound of f i in the i-th subregion we mus t solve

problem p v (see Section 3.3.3)) which in this case is of the form min f ( u , x ) (32) x,ueD i

% (u, x ) <_ O,

U[q) i (u ,x ) ;O~] < 0, i = 1,2 x L < _ x < x v

Each U[q)~(u ,x) ;D~], i=1,2 is of the form

L[q)l" Di ] = allul + ... + a14u4 4- 0.25alsU[(Ul + u2)2" $1i2 ] + 0 .25alsU[_(u 1 Uz) 2. ~ i ~ -- , 1,2 } 4- O16X 1 4- C 1

L[~P2 ; D , ] = a21u 1 + ... + a24u4 4- 0 . 2 5 a 2 5 U [ ( u 3 4-/g 4 ) 2 , S~,4 ] 4- 0 . 2 5 a 2 5 U [ _ ( u 3 _ u4 )2 ;Si3,4 } 4- a26 x 1 4- c 2

where

�9 " i i U[(u , + u z ) 2 ; S [ , 2 ] = ( a ; ) 2 + ( a ; + b , ) ( u , + u 2 - a , )

U [ - ( u I -u2)2;Si,1] = (a~ + b~)[u I - u 2 - 0.5(a~ + b~)]- 0.25(a~ + b~) z

V [ ( u 3 + u 4 ) Z ; s ~ , 4 ] = ( a ~ ) 2 +(a~ +b~) (u 3 - u 4 - a ~ )

U [ - ( u 3 -u4)2;si3,4]- (a~ + b~)[u 3 - u 4 - 0.5(a~ + b4)]- 0.25(a 4 + b4) z

The lower and upper bounds (calculated as described above) will be used in the BB method (Section 3.3.1) for solving the CAMD model in (24). For

59

this simple example, consider the first i teration of the BB method (Section 3.3.1) as follows

Step 1. Set k=l. Give an initial set L(o) of the subregions Di (i=1,..~o, often No=l). Let No=1. This means there is only one subregion D1, which coincides with D.

Step 2. Calculate lower bound (LB) for each subregion. Since there is only one subregion, problem (31) is solved for the case when the

1 1 values of ai,bi,(1-1,...,4)are given by (29). Let u~,u~be the solution

of the problem and f l , be the optimal value of the objective function.

Then Pl = f l *

Step 3. Determine an "optimal" region with the smallest LB. Let it be the/m-th region then

= min Pl l'tlm lei(k)

Now there is only one subregion; therefore l m = 1

Step 4 Split the subregion Dr, into two subregions Dp and Dq (Dim

= Dp~ Dq). Suppose tha t we s tar t branching with the help of the branching function (u 1 + u 2). Then D; and Dq will have the form

Step 5. Determine LB and UB (upper bound) for Dp and Dq. We must solve problems (31) and (32) for both subregions.

Step 6. Determine the smallest UB rl(k) at the k-th iteration.

~(k)- min (~k-1, ~p, ~q) Let ??p > 77q. Then ??(~) =rlq.

Step 7. If 0 ~ Pl < s then STOP with the solution.

Step 8. If pj > 77 ~ for j - p or j - q then the corresponding region is

removed. Suppose pj < 77 k j = p,q

Step 9. Form the new set L(k)of subregions

L (1) =(Dp,Dq)

50

Step 10. Set k=k+l, and go to Step 3 for the next iteration

Note that the algorithm stops in Step 7; the values in the vector of variable u are used to determine which structural groups make up the molecule. For example if Ul-1, then structural group 1 is present in the molecule; otherwise it is absent. On this simple example we showed how underestimators are constructed (Step 6 of Section 3.4) and described one iteration of the BB procedure (Step 7 of Section 3.4).


Adjiman, C. S., Dallwig, S., Floudas, C. A., and Neumair, A. (1998). A global Optimization method, alpha-BB, for general twice-differentiable NLPs --I. Theoretic Advances. Computers and Chemical Engineering, 22(9), 1137-1158.

Archer, W. L. (1996). Industrial Solvent Handbook, Marcel Dekker Inc. Barton, A. F. (1985). CRC Handbook of Solubility Parameters and Other

Cohesion Parameters, CRC Press, Inc., Boca Raton, Florida. Brooke, A. (1996) GAMS - A User's Guide, Scientific Press, San Francisco,

CA Churi, N., and Achenie, L. E. K. (1996). Novel Mathematical Programming

Model for Computer Aided Molecular Design. Industrial and Engineering Chemistry Research, 35(10), 3788-3794.

Constantinou, L., and Gani, R. (1994). New Group Contribution Method for Estimating Properties of Pure Compounds. AIChE Journal, 40, 1697- 1710.

Duvedi, A. P., and Achenie, L. E. K. (1996). Designing Environmentally Safe Refrigerants Using Mathematical Programming. Chemical Engineering Science, 51, 3727-3739.

Friedler, F., Fan, L. T., Kalotai, L., and Dallos, A. (1998). A combinatorial approach for generating candidate compounds with desired properties based on group contribution. Computers and Chemical Engineering, 22(6), 809-817.

Hansen, C. M., and Beerbower, A. (1971). Solubility Parameters. Kirk- Othmer Encyclopedia of Chemical Technology, A. Standen, ed., Interscience, New York.

Horst, R., and Tuy, H. (1990). Global Optimization: Deterministic Approaches, Springer-Verlag, Heidelberg.

Lyman, W. J., Reehl, W. F., and Rosenblatt, D. H. (1981). Handbook of Chemical Property Estimation Methods, McGraw-Hill Book Company.

Maranas, C. D. (1997). Optimal Molecular Design under Property Prediction Uncertainty. AIChE Journal, 43(5), 1250-1263.

McCormick, G. P. (1976). Computability of global solutions to factorable nonconvex programs. Part I -- convex underestimating problems. Math. Program., 10, 147-175.

61

Moore, R. E. (1966). Interval Analysis, Prentice-Hall, Englewood Cliffs, New Jersey.

Odele, O., and Machietto, S. (1993). Computer Aided Molecular Design: A Novel Method for Optimal Solvent Selection. Fluid Phase Equilibria, 82, 47-54.

Ostrovsky, G. M., Ostrovsky, M. G., and Mikhailow, G. W. (1990). Discrete Optimization of chemical processes. Computers and Chemical Engineering, 14(1), 111.

Ostrovsky, G., Achenie, L. E. K., and Sinha, M. "A Reduced Dimension Branch-and-Bound Algorithm for Molecular Design," (to appear in Journal of Global Optimization, circa 2000)

Pantelides, (1996). Global Optimization of General Process Models. In I.E. Grossmann , ed. Global Optimization in Engineering Design, Kluwer Academic Publishers.

Pistikopoulos, E. N., and Stefanis, S. K. (1998). Optimal solvent design for environmental impact minimization. Computers and Chemical Engineering, 22(6), 717-733.

Quesada, I., and Grossmann, I. E. (1995). A Global Optimization Algorithm for Linear Fractional and Bilinear Programs. Journal of Global Optimization, 6, 39-76.

Ryoo, H. S., and Sahinidis, N. V. (1996). A Branch-and-Reduce Approach to Global Optimization. Journal of Global Optimization, 8, 107-138.

Sherali, H. D., and Alameddine, A. (1992). A new reformulation- linearization technique for bilinear programming problems. Journal of Global Optimization, 2, 379-410.

Sinha, M. A Systems Engineering Framework for Solvent Design. Ph.D. Thesis, University of Connecticut, 1999.

Sinha, M., Achenie, L. E. K. and Ostrovsky, G. M. "Design of Environmentally Benign Solvents via Global Optimization," Comp. Chem Eng. 23, 1381-1394, 1999.

Tamiz, M. (1996). Multi-Objective Programming and Goal Programming Theories and Applications, Springer, York.

Vaidyanathan, R., and El-Halwagi, M. (1994). Computer-Aided Design of High Performance Polymers. J. Elastom Plasti., 26(3), 277.

Venkatasubramanium, V., and Chan, K. (1989). A neural network methodology for process fault diognosis. AIChE Journal, 35, 1993.

Computer Aided Molecular Design: Theory and Practice L.E.K. Achenie, R_ Gani and V. Venkatasubramanian (Editors) �9 2003 Elsevier Science B.V. All rights reserved. 63

Chapter 4: Opt imizat ion Methods in C A M D - II

A. Apostolakou & C. S. Adjiman


Computer-aided molecular design (CAMD) is a synthesis activity, which aims to identify a list of candidate molecules that perform a set of tasks most effectively. Its application to a specific problem should always be followed by a verification stage, which relies on experimental data. The availability of property prediction techniques that describe broad classes of compounds is a central issue in CAMD. The accuracy of the techniques used must be sufficient to enable the final candidate list to be meaningful. Several strategies can be followed to deal with the uncertainty inherent in property prediction: some of the property requirements can be relaxed (e.g., Duvedi and Achenie, 1996) or uncertainty can be accounted for explicitly in the formulation of the CAMD optimization problem (Maranas, 1997; Dua and Pistikopoulos, 1998). In many cases, however, it is necessary to develop more reliable property prediction techniques. The advent of connectivity-based group contribution methods is a particularly promising development in this area.

Group contribution methods, based on the principles of transferabili ty and additivity, are widely used in CAMD. In principle, only the mass of a chemical compound is exactly equal to the sum of the masses of its constituents. However, there are many properties which are approximately additive, provided the building blocks are appropriately chosen. Among the potential building blocks (atoms, bonds or groups), the bond and group additivity methods have received most attention. When the property under study depends on the shape of the molecule and on intermolecular forces, the additivity rule becomes less reliable. For example, by substi tuting fluorine and/or chlorine into a hydrocarbon molecule, polarity effects lead to a failure of the simple additivity principle. The change caused by fluorine cannot be at t r ibuted to the group alone but also depends on the environment in which the substi tuent is placed. Consequently, both structural and environmental or proximity effects must be accounted for in property prediction. In view of these issues and of the inability of simple group contribution techniques to distinguish between isomers, there has been a drive to develop property prediction techniques, which use the connectivity of molecules as a basis. For pure component properties such as critical constants, for instance, several approaches have been proposed: Needham et al. (1988) have used

54

Kier's shape indices (Kier and Hall, 1976), Constantinou et al. (1993) have developed a technique based on the concept of conjugate forms, Constantinou and Gani (1994) and Marrero and Gani (2001) have presented a "high-order" group contribution method, Marrero-MorejSn and Pardillo-Fontdevila (1999) have proposed the use of group-interaction contributions. Other recent connectivity-based methods are discussed in Poling et al. (2000).

Early CAMD methologies were unable to take advantage of the availability of connectivity-based structure-property relationships. The number of atom groups of different types in the candidate molecule has been used as the key decision variable in "generate-and-test" methods (Gani and Brignole, 1983; Brignole et al., 1986; Joback and Stephanopoulos, 1989; Gani et al., 1991), in mixed-integer optimization- based approaches (Macchietto et al., 1990; Odele and Macchietto, 1993; Duvedi and Achenie, 1996; Maranas, 1996; Pistikopoulos and Stefanis, 1998; Buxton, et al., 1999), and in stochastic-based optimization methods (Venkatasubramanian et al., 1995, Marcoulaki and Kokossis, 1998, 2000a,b; Ourique and Telles, 1998). With the advent of connectivity-based prediction methods, several researchers have developed new strategies for embedding this information with the CAMD methodology. Constantinou et al. (1996) have proposed a systematic strategy for generating isomers from a set of groups. Harper et al. (1999) have used this capability to integrate additional property prediction techniques based on molecular modeling within their CAMD framework. Churi and Achenie (1996) have developed a mixed integer formulation for mathematical programming approaches on the basis of the graph-theoretic representation of molecules. The integer decision variables are the number of groups of a given type in the molecule, binary variables denoting whether a bonding site j on a vertex i is bonded to another vertex p, and binary variables denoting whether a group of type k is at vertex i. Raman and Maranas (1998) have also used graph theory to derive a convex MINLP formulation for use with topological indices. In this case, the binary decision variables denote whether two vertices in the molecular graph are connected. Camarda and Maranas (1999) extended the formulation to identify the specific groups in the molecule. A similar formulation was recently used for the design of pharmaceutical products by Siddhaye et al. (2000).

In their high-order group contribution approach for pure component properties, Marrero and Gani (2001) proposed (i) to enhance the group contribution approach with a larger set of functional groups that allows a more detailed representation of chemical structures, and (ii) to use large data sets to estimate the contributions of these groups. This method has led to significant improvements in accuracy and applicability of group contribution techniques. As a result, we seek to develop a formulation of the optimization-based molecular design problem, which makes use of this new property prediction technique and accounts for the full connectivity of the molecule. We have chosen to use an optimization-based approach

65

because it enables an implicit search of the space of solutions, which is extremely large due to the combinatorial nature of the problem (Joback and Stephanopoulos, 1989; Maranas, 1996). Thus, a large number of molecular structures can be eliminated without fully evaluating their performance. This characteristic is especially valuable when the property estimation techniques and/or performance criteria used require expensive computations, or when the simultaneous design of material and process (e.g., Buxton et al., 1999) is addressed.

In the next section, we present the problem definition, highlighting the features of the group contribution method used, and we propose the formulation of a general mixed-integer nonlinear program (MINLP) accounting for connectivity. The formulation is applicable to molecules containing arbi t rary numbers of rings, aromatic or otherwise. It also allows the distinction between isomers of aromatic compounds. We note tha t some molecules can be multiply defined in terms of the groups used by Marrero and Gani (2001) and that some rules must be applied to allow the unique identification of suitable molecular descriptions. As a result, in Section 4.3, we develop a systematic strategy for ensuring that molecules are correctly represented according to the Marrero and Gani rules. In Section 4.4, we apply the proposed approach to the design of aromatic compounds and in Chapter 12, to a simple refrigerant design problem.

4.2 P R O B L E M D E F I N I T I O N

4.2.1 G e n e r a l p r o b l e m f o r m u l a t i o n

Most CAMD problems can be stated as "given a desired range for a set of properties and performance criteria, design the compound that performs best, while possessing properties within the acceptable range" (Vaidyanathan and El-Halwagi, 1996). In order to write the general formulation for a CAMD problem, we introduce the following variables

�9 lr is the vector of properties of the compound, �9 y is the vector of integer variables that determine the molecular

structure, �9 x is the vector of relevant process variables, if applicable.

A t y p i c a l CAMD p r o b l e m m a y t a k e t h e form

rain y , x

s.t.

F(Tr(y,x))

n: L ___ Jr(y, x) < tc g(y) <_ 0 h(y) =0 y ~ {0,1} q XE R n

(1)

66

where z U and z n are upper and lower bounds on the property values, F is the performance criterion to be optimized, and g and h are vectors of inequality and equality constraints generally associated with structural feasibility requirements as well as preferences imposed by the designer.

4.2.2 Descr ip t ion of group contr ibut ion m e t h o d

The group contribution method recently proposed by Marrero and Gani (2001) allows the estimation of nine important physical properties of pure organic compounds (normal boiling point, critical temperature, critical pressure, critical volume, s tandard enthalpy of formation, s tandard enthalpy of vaporization, s tandard Gibbs energy, normal melting point, s tandard enthalpy of fusion). One of the distinguishing features of the method is its accuracy for a varied and large set of compounds. Parameter tables have been developed from regression using a data set of about 2000 compounds with 3 to 60 carbons, including large, polyfunctional and complex heterocyclic compounds. The properties of a compound are calculated from the contributions of three types of groups: first order groups, second order groups and third order groups. The first order groups are intended to describe a wide variety of organic compounds and are larger and more numerous than groups in the commonly used method of Joback (1987). First order groups allow some level of distinction between isomers. The role of the second and third order groups is to provide further s tructural information about molecular fragments of compounds in order to distinguish between more isomers and to account for proximity effects arising from polyfunctionality. Thus, the estimation is performed at three levels. The overall property estimation model has the following form

k e g I k e G 2 k e g 3

where Ch is the contribution of the first order group of type k that occurs n l h t imes in the molecule, Dk is the contribution of the second order group of type k tha t occurs n2h times and Eh is the contribution of the third order group of type k that occurs n3k times. G1, G2 and G3 are the sets of first, second and third order groups respectively, c2 and c3 are weights equal to 1 or 0 which allow second- and third-order corrections to be turned on or off respectively. The left-hand side of (2) is a simple function f ( z ) of the target property z as listed in Marrero and Gani (2001).

In this work, we develop a formulation of the general CAMD problem (1) which allows the use of this more versatile group contribution method. We focus exclusively on first order predictions as they already allow the representat ion of a wide variety of chemical classes including simple aromatics and cyclic compounds (see Table 1 for a list of first order

67

groups). The rules, which must be applied when deciding which first-order groups make up, a given molecule (Marrero and Gani, 2001) are

Rule 1. groups.

Rule 2.

The molecule must be described entirely by first-order

There must be no overlap between first-order groups.

If alternative first order representations of a molecule are possible:

Rule 3. In general, the heaviest first order groups are used. Thus, while CH~CH2COO can in principle be represented as (CH3, CH2, CO0) or (CH3, CH2COO), the latter description should always be used because CH2COO is heavier than CH2 or COO.

Rule 4. For an aromatic substituent, an aC-R group should be used instead of an aC group.

Rule 5. For amides and ureas, the amide and urea groups should be used.

4.2.3 Molecu lar r e p r e s e n t a t i o n

To formulate the molecular design problem using the groups in the method of Marrero and Gani (2001), the number of each first order group in the compound must be determined. This requires the definition of a set of basic groups and the specification of the connectivity of the molecule. In to developing a mathematical framework for this problem, a graph representation of molecules has been adopted. In particular, a molecule is represented by a graph where basic groups and their bonds correspond to graph vertices and edges, respectively (Horvath, 1992; Mavrovouniotis, 1996). The vertex adjacency matrix or any other matrix used for representing a graph can be used to completely determine the molecular graph. In general, it suffices to describe each basic group by a number and a valency (number of bonds formed).

First order groups (FOGs) are prime candidates to be used as basic groups. However, a number of the first order groups proposed by Marrero and Gani (2001) have two different atoms with free bonds: CH2CO can be connected to another group via the CH2 carbon or the CO carbon. In this case, information on the type and number of FOGs occurring in a molecule, and on the connectivity between these groups or vertices, is not always sufficient to unambiguously determine the molecular structure. For instance, the set of groups (CH3, CH3, CH2, CH2CO) with vertex adjacency matrix

CH 3 CH 3 CH 2 CH2CO CH 3 0 0 1 0 CH 3 0 0 0 1 CH 2 1 0 0 1 CH2CO 0 1 1 0

68

describes diethyl ketone (CH3CH2COCH2CH3) and 2-pentanone (CH3CH2CH2COCH3). This can be a t t r ibuted to the a symmet ry of the CH2CO group. However, 2-pentanone could also be constructed from the groups (CH2, CH2, CH3, CH3CO). According to rule 3, this second set of groups should be preferred since CH3CO is heavier than CH2CO. Thus the bond C H 3 - CH2CO is allowed if CH3 and CH2 are bonded, but forbidden if CH3 and CO are bonded. To use first order groups as basic groups, we mus t thus be able to provide a unique identification of the connectivity of the molecule.

This issue is addressed by assigning to each first order group a valency for each b o n d type. For instance, group CH2CO has two bond types, a 'CH2' bond and a 'CO' bond, with a valency of 1 for each bond type. Its overall valency is 2. In order to keep the number of pa ramete rs and variables to a min imum, the bond types are labeled 'a', 'b', 'c'. Only three bond types are then needed to describe all the first-order groups proposed by Marrero and Gani (2001). The ass ignment of group type and valency information vk, t for group k and bond type t is listed in Table 1. The vertex adjacency mat r ix for diethylketone is then

CH 3 CH 3 CH 2 CH2CO a CH2CO b CH 3 0 0 1 0 0 CH 3 0 0 0 1 0 CH 2 1 0 0 0 1

CH2CO a 0 1 0 0 0 CH2CO b 0 0 1 0 0

and tha t for 2-pentanone,

CH 3 CH 3 CH 2 CH2CO a CH2CO b CH 3 0 0 1 0 0 CH 3 0 0 0 0 1 CH 2 1 0 0 1 0

CH2CO a 0 0 1 0 0 CH2CO b 0 1 0 0 0

The mat r ix for 2-pentanone does not belong to the set of allowable vertex adjacency matrices. The different bond types are also useful when dealing with aromatic and cyclic compounds. By convention, aromatic bonds in aromatic groups are systematical ly assigned to type 'a'. Types 'b'and 'c' are then used for bonds which connect the aromatic group to non-aromatic bonds. For instance group 21, aC-CH2, has an aromatic (aC) bond of valency 2 assigned to type 'a' and a non-aromatic (CH2) bond of valency 1 assigned to type 'b'. For all aromatics except group 16, the valency of the aromatic 'a' bond is 2 since each aromatic carbon in a ring must be bonded to two other aromatic carbon. For group 16, which is used for fused aromatics, v16,a-3. Similarly, cyclic bonds in cyclic compounds are

59

s y s t e m a t i c a l l y a s s i g n e d to t y p e s 'a' a n d 'b'. T h e b o n d t y p e 'b' is n e e d e d

s ince s o m e cyclic g r o u p s such as C H = C a r e a s y m m e t r i c a n d r e q u i r e two

cyclic b o n d types . B o n d s on cyclic g r o u p s t h a t c an be m a d e w i t h noncyc l ic

g r o u p s a r e a s s i g n e d to t y p e 'c'. S ince all cyclic g r o u p s h a v e e x a c t l y two

o t h e r cyclic b o n d s so t h a t Vk, a+Vk, b=2 for a n y cyclic g r o u p k. T h e f i r s t o r d e r

g r oups , t h e i r b o n d t y p e s a n d v a l e n c i e s a r e l i s t ed in T a b l e 1.

Tab le 1: F i r s t order g r o u p s a n d the ir b o n d s

I G r o u p B o n d t y p e a 1 B o n d t y p e b 2 B o n d tYpe C

G r o u p Vk, a G r o u p Vk, b G r o u p Vh.c

1 CH3 CH3 1 - - 0 0

2 CH2 CH2 2 - - 0 0

3 CH CH 3 - - 0 ~ 0

4 C C 4 0 ~ 0

5 C H 2 = C H C H 2 = C H 1 ~ 0 - - 0

6 C H = C H C H = C H 2 m 0 - - 0

7 CH2=C CH2=C 2 m 0 ~ 0

8 C H = C C H = 1 C= 2 ~ 0

9 C=C C=C 4 ~ 0 ~ 0

10 C H 2 = C = C H C H 2 = C = C H 1 - - 0 ~ 0

11 C H 2 = C = C CH2=C=C 2 ~ 0 ~ 0

12 C H = C = C H C H = C = C H 2 ~ 0 ~ 0

13 C H - C C H - C 1 ~ 0 ~ 0

14 C - C C - C 2 - - 0 ~ 0

15 a C H a C H 2 ~ 0 ~ 0

16 aC 4 aC 3 - - 0 ~ 0

17 aC 5 aC 2 aC 1 ~ 0

18 aC 6 aC 2 aC 1 - - 0

19 a N a N 2 ~ 0 ~ 0

20 aC-CH3 aC-CH3 2 ~ 0 ~ 0

21 aC-CH2 aC 2 CH2 1 ~ 0

22 a C - C H aC 2 CH 2 0

23 aC-C aC 2 C 3 0

24 a C - C H = C H 2 a C - C H = C H 2 2 0 ~ 0

25 a C - C H = C H aC 2 C H = C H 1 ~ 0

26 a C - C = C H 2 aC 2 C=CH2 1 ~ 0

27 a C - C - C H a C - C - C H 2 ~ 0 ~ 0

28 a C - C - C aC 2 C - C 1 m 0

29 O H O H 1 - - 0 ~ 0

30 a C - O H a C - O H 2 ~ 0 ~ 0

C l a s s 3

S

S

S

S

S

S

S

S

S

S

S

S

S

S

A

A

A

A

A

A

A

A

A

A

A

A

A

A

S

A

' For aromatics , the free bonds mus t be l inked to other aromatic C or N. For cyclics, they mus t be l inked to other cyclic atoms. 2 For cyclics, the free bonds mus t be l inked to other cyclic atoms. 3 See Section 4.3 - A: aromatic group, UA: urea or amide group; UAS: urea or amide subgroup; S: s t anda rd group. 4 Fused with aromat ic ring. 5 Fused with non-aromat ic subring. This group belongs to the set G1a. 6 Except as groups 16 and 17.

70

.Table I (continued) 31 COOH 32 aC-COOH 33 CH3CO 34 CH2CO 35 CHCO 36 CCO 37 aC-CO 38 CHO 39 aC-CHO 40 CH3CO0 41 CH2CO0 42 CHCO0 43 CCOO 44 HCOO 45 aC-CO0 46 aC-OOCH 47 aC-OOC 48 COO 49 CH30 50 CH20 51 CH-O 52 C-O 53 aC-O 54 CH2NH2 55 CHNH2 56 CNH2 57 CH3NH 58 CH2NH 59 CHNH 60 CH3N 61 CH2N 62 aC-NH2 63 aC-NH 64 aC-N 65 NH2 66 CH=N 67 C=N 68 CH2CN 69 CHCN 7O CCN 71 aC-CN 72 CN 73 CH2NCO 74 CHNCO 75 CNCO 76 aC-NCO

COOH aC-COOH

CH3CO CO CO CO AC

CHO aC-CHO CH3COO

COO COO COO

HCO0 AC

aC-OOCH AC CO

CH30 0 0 0

AC CH2NH2 CHNH2 CNH2

CH3NH NH NH

CH3N N

aC-NH2 AC AC NH2 CH= C=

CH2CN CHCN CCN

aC-CN CN

CH2NCO CHNCO CNCO

aC-NCO

CH2 CH C

CO

CH2 CH C

COO

OOC 0

CH2 CH C 0

CH2 CH

CH2

NH N

N= N=

S A

UAS UAS UAS UAS

A S A S S S S S A A A S S S S S A S S S

UAS UAS UAS UAS UAS

A A A

UAS S S S S S A S S S S A

71

Table 1 (continued) 77 CH2NO2 78 CHNO2 79 CNO2 80 aC-NO2 81 NO2 82 ONO 83 ONO2 84 HCON(CH2)2 85 HCONHCH2 86 CONH2 87 CONHCH3 88 CONHCH2 89 CON(CH3)2 90 CONCH3CH2 91 CON(CH2)2 92 CONHCO 93 CONCO 94 aC-CONH2 95 aC-NH(CO)H 96 aC-N(CO)H 97 aC-CONH 98 aC-NHCO 99 aC-NCO 100 NHCONH 101 NH2CONH 102 NH2CON 103 NHCON 104 NCON 105 aC-NHCONH2

106 aC-NHCONH 107 NHCO 108 CH2C1 109 CHC1 110 CC1 111 CHC12 112 CC12 113 CC13 114 CH2F 115 CHF 116 CF 117 CHF2 118 CF2 119 CF3 120 CC12F 121 CHC1F

CH2NO2 1 CHNO2 2 CNO2 3

aC-NO2 2 NO2 1 ONO 1 ONO2 1

HCON(CH2)2 2 HCONHCH2 1

CONH2 1 CONHCH3 1

CONH 1 CON(CH3)2 1

CONCH3 1 CON 1

CONHCO 2 CO 2

aC-CONH2 2 AC-NH(CO)H 2

AC 2 AC 2 AC 2 AC 2

NHCONH 2 NH2CONH 1 NH2CON 2

NHCO 1 NCON 4

AC- 2 NHCONH2

AC 2 NH 1

CH2C1 1 CHC1 2 CC1 3

CHC12 1 CC12 2 CCI~ 1 CH2F 1 CHF 2 CF 3

CHF2 1 CF2 2 CF3 1

CC12F 1 CHC1F 1

CH2

CH2 CH2

N

N(CO)H CONH NHCO

N

N

NHCONH CO

CO

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

S S S A S S S

UA UA UA UA UA UA UA UA UA UA A A A A A A

UA UA UA UA UA A

A UA S S S S S S S S S S S S S S

72

Table 1 (continued) 122 CC1F2 123 aC-C1 124 aC-F 125 aC-I 126 aC-Br 127 I 128 Br- 129 F 130 CI 131 CHNOH 132 CNOH 133 aC-CHNOH 134 OCH2CH2OH 135 OCHCH2OH

136 OCH2CHOH 137 -O-OH 138 CH2SH 139 CHSH 140 CSH 141 aC-SH 142 -SH 143 CH3S 144 CH2S 145 CHS 146 CS 147 aC-S- 148 SO 149 SO2 150 SO3 (sulfite) 151 SO3 (sulfonate) 152 SOn 153 aC-SO 154 aC-S02 155 PH 156 P 157 PO3 (phosphite) 158 PHO3 159 PO3

(phosphonate) 160 PHO4 161 PO4 162 aC-PO4 163 aC-P 164 CO3 165 C2H30

CC1F2 1 aC-C1 2 aC-F 2 aC-I 2

aC-Br 2 I 1

Br 1 F 1 C1 1 - -

CH 2 NOH C 3 NOH

AC 2 CH OCH2CH2OH 1

O 1 CHCH. 2OH

OCH2 1 CHOH OOH 1

CH2SH 1 CHSH 2 CSH 3

aC-SH 2 SH 1 - -

CH3S 1 - -

S 1 CH2 S 1 CH S 1 C

AC 2 S SO 2 SO2 2 - -

SO3 2 - -

SO 1 O 304 2 - -

AC 2 SO AC 2 S02 PH 2 - -

P 3 P03 3 - -

PHO3 2 - -

PO 1 O

PH04 3 - -

PO4 3 - -

AC 2 PO4 AC 2 P CO3 2 - -

C2H30 1 - -

NOH

S A A A A S S S S S S A S S

S S S S S A S S S S S A S S S S S A A S S S S S

S S A A S S

73

Table 1 (continued) 166 C2H20 167 C20 168 CH2 (cyc) 169 CH (cyc) 170 C (cyc) 171 CH=CH (cyc) 172 CH=C (cyc) 173 C=C (cyc) 174 CH2=C (cyc) 175 NH (cyc) 176 N (cyc) 177 CH=N (cyc) 178 C=N (cyc) 179 0 (cyc) 180 CO (cyc) 181 S (cyc) 182 SO2 (cyc)

CeHeO 2 - - 0 - - 0 S CeO 4 m 0 m 0 S

CHe(cyc) 2 - - 0 m 0 S CH(cyc) 2 - - 0 CH 1 S C(cyc) 2 - - 0 C 2 S

CH=CH(cyc) 2 ~ 0 ~ 0 S CH=(cyc) 1 C=(cyc) 1 C= 1 S C=(cyc) 2 - - 0 C= 2 S

CH2=C(cyc) 2 - - 0 - - 0 S NH(cyc) 2 0 - - 0 S N(cyc) 2 - - 0 N 1 S

CH=(cyc) 1 N=(cyc) 1 - - 0 S C=(cyc) 1 N=(cyc) 1 C 1 S O(cyc) 2 ~ 0 ~ 0 S

CO(cyc) 2 ~ 0 ~ 0 S S(cyc) 2 - - 0 0 S

S

A graph r ep resen t ing a single molecule has the following propert ies:

1. It is a connected single graph, t ha t is, every pair of vert ices is joined by a path.

2. It has no loops, t ha t is, no edges joining a ver tex to itself. 3. It is a labeled graph.

(1,CH3,a) (2, CH2,a) (3,CH3,a)

(4,OH,a)

Figure 1: Graphical representation of 2-propanol.

Throughou t this work, a ver tex in a molecular g raph is d is t inguished from the other vert ices in the graph by a unique number , the enumeration number and is also labeled after the basic group, which occurs at this pa r t i cu la r vertex. For example, in Fig. 1, the graph rep resen ta t ion of 2- propanol is shown as a t ree with 4 vertices (1,CH3,a), (2,CH,a), (3,CH3,a) and (4,OH,a), and 3 edges ((1,CH3,a), (2,CH,a)), ((2,CH,a), (3,CH3,a)) and ((2,CH,a), (4,OH,a)). The set of edges completely de te rmines the molecular g raph of the compound. Since the e lements of the ver tex adjacency mat r ix of the molecular g raph define the set of its edges, the mat r ix gives complete informat ion on the molecule's make-up and connectivity.

74

We now introduce var iables for the s t ruc tura l description of a molecule, and then derive s t ruc tura l feasibil i ty const ra in ts in t e rms of these var iab les to ensure t ha t the molecules are physically meaningful and t ha t they are described according to the rules of the high-order group contr ibut ion method of Marrero and Gani (2001).

4.2.4 D e f i n i t i o n of S t r u c t u r a l V a r i a b l e s

Variables represen t ing the existence of a bond between each pair of basic groups are used as key s t ruc tura l variables. In t e rms of the graphical represen ta t ion , these var iables are the edges of the graph. We first define the following sets and paramete rs :

a l Gin only} ale Via T V

= { k I k is a basic first order group} - { k I k is a first order group with non-cyclic, non-aromat ic bonds

= { k I k is a first order group with cyclic bonds} = { k I k is a first order group with aromat ic bonds} = { t I bond type } = {a, b, c} = { i I enumera t ion number of a vertex in molecular graph}

Nl,max -maximum number of basic groups in final compound.

Several b inary var iable represen ta t ions can be adopted. Chur i and Achenie (1996), R a m a n and M a r a n a s (1998) and C a m a r d a and M a r a n a s (1999) used a b inary var iable Zi, t,p, denoting whe ther ver tex i is l inked via a t type bond to ver tex j

1, if vertex i is linked via bond type t to vertex j z i , t , J - O, otherwise

In addit ion, they defined a var iable ui, h to describe the existence of group k at ver tex i

{~ if group k exists at vertex i, Ui,k = otherwise

where ke G1 and ie V. While this notat ion is very compact in t e rms of the n u m b e r of variables, it leads to a large n u m b e r of constraints . For ins tance to prevent a bond be tween CH2CO b and CH3 a, as discussed in Section 4.2.3, V 2 const ra in ts mus t be imposed, where V is the n u m b e r of vert ices in the set V. The const ra in ts are of the form

Ui, CH2C 0 + Uj, CH 3 + Zi, b, j <_ 2,Vi, j e V. (3)

In this work, we develop an a l te rna t ive formulation, which involves a la rger n u m b e r of variables, but fewer constraints . In m a n y cases, such

75

formulat ions can el iminate much of the symmet ry of the problem and resul t in problems tha t can be solved more easily (Barnhar t et al., 1998). We use the variable ui, k defined previously and we denote the existence of an edge between bond type t of group k at vertex i and bond type tt of group k k at vertex j is modeled via the binary variable y(i,k,t),(j, kk, tt)

10 if bond type t of group k at vertex i is connected to bond type tt of Y ( i , k , t ) , ( j , k k , t t ) = group kk at vertex j,

otherwise

where k, k k e G1, t, t te T and i,je V. Variable y defines the vertex adjacency mat r ix for the molecule. In this case, a bond between between CH2CO b and CHa a is prevented by imposing a single constraint

Z Z = 0. (4) Y(i, CH2CO, b) , ( j , CH3,a)

i ~ V j ~ V

The number of occurrences of group k in the molecule is modeled through the introduction of the integer variable nlh. To simplify the notation, we will use the following symbols:

Z - Z , Z -Z,Z =Z,Z -Z,s =Z,Z =Z k~ G 1 k kk~ G 1 kk i~ V i j e V j t~ T t tt~ T tt

Fur thermore , multiple summat ions will be expressed as

s 1 6 3 -Z i j i , j

The number of edges q in the molecular graph is expressed as

1 q = -2 Z Y(i,k,t),(j, kk,tt)" (5)

i , j , k , k k , t , t t

The factor of �89 takes into account the fact tha t each edge is counted twice.

The number of vertices pv is given by

k ~ G 1 i k e G 1

The sum of the valencies of all the vertices in the graph is given by

76

Z Z Vk, tUi, k (7) k~ G 1 t,i

In order to calculate the number of occurrences nlk of group k in the structure, we set

nlk - Z Ui,k ' Vk ~ G 1 . (8) i

4.2.5 S tructura l Feas ibi l i ty Constraints

The main feasibility constraints stem from the fact tha t the representat ion of a molecule must correspond to a connected, single graph with no loops. The first constraint ensures tha t each bond type on each group at each vertex is connected to the appropriate number of other groups

Vk,tUi, k = Z Y( i , k , t ) , ( j , kk , t t ) ,Vk ~ G1, gt e T, gi e V. j ,kk , t t

(9)

The following constraint ensures tha t the graph obtained is connected (Churi and Achenie, 1996). It implies tha t group k can occur at vertex i only if this vertex is connected to at least one of the vertices 1 to i-1.

i-1

Z Z Y(i,k,t),(j,kk,t) >- ui,k,'v'k e G 1 , g i ~ V \ {1}. kk,t,tt j=l

(10)

To prevent the existence of loops, through a vertex connected to itself or two vertices connected together through two single bonds, we specify

Z = 0, Vi V. Y(i,k,t),(i, kk,tt) E

k,kk,t , t t

Z < 1, Vi, V J Y(i,k,t),(j, kk,t) E

k,kk,t , t t

(11)

(~2)

The symmetry of the vertex adjacency matrix is enforced by

Y(i,k,t),(j,kk,tt) = Y(j,kk,tt),(i,k,t),Vk, kk ~ G1,Vt, tt ~ T, Vi, j e V. (13)

The requi rement tha t at most one group can occur at each vertex is expressed as

Z Ui, k <l, Vi~ V. (14)

k~G l

77

Additional constraints are used to describe the allowable connections between groups, in par t icular for cyclic and aromatic groups. The number of rings in the molecule, Rtot, is determined by

1 -2 Z Y(i,k,t),(j ,kk,tt)-ZZ ui,k--1-kRt~ (15)

k,kk,t,tt,i,j k~G 1 i

We note tha t the set of non-cyclic groups is G1nwG1a\{17}. Group 17 is unique in tha t it is both aromatic and cyclic and is included in the set of aromatic groups only. It has only one cyclic bond, which is assigned to bond type 'b'. Cyclic bonds in cyclic groups must only be connected to cyclic bonds in other cyclic groups. This is expressed by prevent ing a cyclic group k at vertex i from being bonded to a non-cyclic group kk at vertex j through a bond type 'a' or 'b'

Z Z Z (Y(i,k,a),(j,kk,t) + Y(i,k,b),(j,kk,t)):0. ke Glc kke GIn WGla \ {17} i, j,t

(16)

and by prevent ing bond type 'a' or 'b' in a cyclic group from being bonded to group type 'c' in another cyclic group and to bond type 'a' in group 17

ke G lc kke G ic. i,j j

(~7)

Group 17, which possesses a cyclic bond ('b') but is not included in the set of cyclic groups mus t also be constrained in a similar way

Z Z Y(i,17,b),(j,k,t) =0. (18) k~Gln UGla i,j,t

Z ZY(i,17,b)(J,k, c) : 0 . (19) k~Glc i,j

Similarly, aromatic bonds (type 'a') in aromatic groups mus t not be l inked to non-aromatic groups

Z Z ZY(i,k,a),(j,kk,t) =0" k~Gla kk~Gl. UGlc i,j,t

(20)

Fur ther , aromatic bonds in aromatic groups must be not be linked to non- aromatic bonds 'b' or 'c' in other aromatic groups

Z Z + o keGla kkeGla i,j

78

Additional constraints must be imposed on aromatic rings to ensure tha t each such ring contains exactly six aromatic groups. For this purpose, we define the variable Ra, tot, a s the number of aromatic rings. We label the rings by defining two types of binary variables

R = ~ 1, if aromatic ring w exists

aw [0, otherwise

and

= { 1, if vertex i is in ring w

r~'w 0, otherwise

Possible configurations of aromatic rings are shown in Figure 2.

jl j

@@ @ J2

(a) (b) (c)

(d) (e)

Figure 2: Possible configurations of aromatic rings.

In order to assign values to the ri, w variables, we note that if there is no aromatic group at vertex i, the vertex cannot be in any ring. If there is an aromatic group other than a fused aromatic (group 16), the vertex must be in exactly one ring (Fig. 2a). If there is a fused aromatic, the vertex must be in two to three rings (Fig. 2b-2e). This is ensured through the following constraints

E lgi, k 4;- 2Ui,16 <__ E ri,w <- E Ui,k + 3Ui,16' g i e V. (22) keGl,~\{16} w keGl,~\{16 }

where we W, the set of aromatic rings. To ensure that each ring has exactly six aromatic groups, we set

79

~_~ ri, w = 6R~w, V w e W. (23) i

Final ly, we need to ensure tha t each r ing consist ing of six a romat ic groups forms a closed loop. This is enforced by forbidding cer ta in types of bonds as follows. In case (a) in Figure 2, only n o n - f u s e d aromat ics are present . Ver tex i in r ing w~ mus t be in the same ring as the vert ices it is connected to v i a i t s a r o m a t i c b o n d s (type 'a'). Thus, it is in the same r ing as ver tex j l , but in a different r ing from ver tex j2. In other words, two non-fused a romat ic groups at vert ices i and j cannot be connected th rough the i r bond type 'a' if the two vertices are in different r ings wl and w2, as imposed by the cons t ra in ts

keGl~ \ {16} kkeGl~ \ {16} Y(i,k,a),(j ,kk,a) + ri, wl + rj, w 2 _<2,Vi, j e V, Vwl,w 2 e W,w 1 r w 2.

(24)

In case (b) in Figure 2, the f u s e d aromat ic at ver tex i is l inked to a n o n -

f u s e d aromat ic at ver tex j. This can only happen if ver tex i is shared by two aromat ic rings, wl and we, and ver tex j is in one of these rings. This is expressed ma thema t i ca l l y as

Z Y(i,16,a),(j,k,a) § t),w 1 + ri, w 2 + rj, w 3 k~Gl~ \{16}

_< 3,Vi, j e V, V W l , W 2 , W 3 e W,

w2 r wl , w3 r wl , w2.

(25)

Cases (c), (d) and (e) of Figure 2 represent ins tances where t w o f u s e d

aromat ic groups are bonded. In case (c), each fused aromat ic group belongs to exactly two rings. Vertices i and j l share one common ring. Vertices i and j e share two common rings. Thus, two l inked fused aromat ics belonging to two r ings each mus t belong to at leas t one common ring. In case (d), the fused aromat ic group at i belongs to two r ings and t h a t at j belongs to three rings. Vertex i mus t share two r ings wi th ver tex j. S imi lar ly in case (e), the fused aromat ic groups at vert ices i and j belong to th ree r ings each, and they mus t share two rings. These conditions can be expressed ma thema t i ca l l y th rough the const ra in ts

Y(i,16,a),(j,16,a) +r / ,w 1 +r / ,w 2 +r / ,w 3 +r / ,w 4 +r / ,w 5 +r / ,w 6 < 2 + Z ri, w 'V i ' j eV ' w~ W

Vv~ , w2 , w3 , w4 , ws , w 6 e W, w 2 r Wl;W 3 r Wl , W2 ; W 4 r Wl , Wz , W3 ; w 5 ~ Wl ,Wz ,W3,W4;W 6 r Wl ,Wz ,W3,W4,Ws .

(26)

Cons t ra in t (26) applies as follows. In cases (c) and (d) in Figure 2, the fused a romat ic at ver tex i is in two rings so t ha t the r igh t -hand side of (26) is 4. If the fused aromat ic groups at i and j are bonded, y(i,16,a),O,16,a) = 1. If the two r ings of i are wl and we, ri,wl + ~,w2 + ri, w3 = 1. As a resul t , we mus t

80

have r/-,w 4 + ~,w5 + ri, w6 -< 1. Thus, only one of the r ings of j can be different

from the r ings of i, regardless of whe ther j is in two or three rings. In case (e) in Figure 2, the fused aromat ic at ver tex i is in three r ings and the r i gh t -hand side of (26) is 5. If the fused aromat ic groups at i and j are bonded, y(i,16,a),(j,16,a) ---- 1. If the three rings of i are wl, we, w~, r/,w~ + r/.,w 2 + ri, w3 = 1. As a result , we mus t have ri, w4 + ri, w5 + ri, w6 _< 1. In

other words, j only one of the rings of j can be different from the r ings of i, regard less of whe ther j is in two or three rings.

Final ly, a few const ra in ts are added to obtain a t ighter formulat ion. In par t icu lar , an aromat ic group can be found at a given ver tex only if there are a romat ic r ings in the molecule

Ui,k <- Z Raw, Vk ~ G l a , Vi ~ V. (27) we W

Similar ly , there can only be a cyclic group at a given ver tex if there are non-aromat ic cycles in the molecule

Ui,k < Rt~ - Z Raw, Vk e Glc , Vi e V. (28) w

Fur the r , the n u m b e r of aromat ic r ings cannot be grea ter t han the n u m b e r of r ings

Z Raw - Rto t < O. (29) w~ W

The issue of r e d u n d a n t number ing of aromat ic r ings is par t ia l ly avoided by imposing ordering of the rings, tha t is

Raw - Raw_ 1 <_ O, V w E W \ {1}. (30)

We now consider addi t ional const ra ints to make sure t ha t Rules 1 to 5 (Section 4.2.2) for the a s s ignment of first order groups are satisfied. Rules 1 and 2 are met de facto because of the choice of first order groups as basic groups. Rules 3 to 5 require the identification of forbidden matches in the set of first order groups. For instance, the combinat ion of groups l a and

48a (CH~ and CO0 a) violates rule 3 and the heavier group 40 (CH3COO)

should be used. This can be enforced by including the cons t ra in ts

Z Y(i,l,a),(j,48,a) = O. (31) k,kk

A methodology to identify forbidden matches is described in Section 4.3.

81

Finally, bound constraints on the number of groups present in the s t ruc ture are used. Since a min imum of two groups are required to form a molecule, we set

Z nlk _> 2. (32) k~G 1

The upper bound on the total number of first-order groups is set as a designer specification through the following constraint

Z nlk <- Nl,max" (33) keG~

The set of constraints (8)-(30), (32) and (33) along with forbidden matches of type (31) provides the s t ructural feasibility constraints for the design of a compound based on the set of first-order groups proposed by Marrero and Gani (2001). One drawback of this formulation is the lack of a constra int tha t directs the number ing of the vertices in the molecular graph. As a result, a molecule can be represented in a number of a l ternat ive ways, which differ from each other only in the vertex numbering. A par t ia l solution to this problem is achieved by specifying consistent number ing (Churi and Achenie, 1996)

Z ui, k -- ~ a ~- O, Vi V\ (34) {1}. Ui-l,k E

k~G 1 k~G 1

The unambiguous description of molecules through their full connectivity allows the use of group contribution methods based on different sets of groups within the same optimization problem. For instance, the Chueh and Swanson (1973) method for the prediction of molar liquid hea t capacities is based on a set of groups, which differs from the first order groups of Marrero and Gani (2001). In addition, their methodology requires some information on connectivity: C1 groups have a different contribution depending on the total number of C1 groups bonded to a single carbon atom. Constraints tha t enable the use of the Marrero and Gani (2001) and Chueh and Swanson (1973) methods s imul taneously can be formulated using s tandard methods of logic modeling with 0-1 variables. This is demonst ra ted on the case study presented in Section 4.4. The proposed formulation also allows the distinction between para, ortho and meta isomers of aromatic compounds.

82

4.2.6 P r o b l e m Type and So lut ion

The overall molecular problem as formulated belongs to the class of mixed- integer problems. All the binary variables participate linearly in the problem. Depending on the form of the property expressions used (Eq. (2)), the optimization problem is a Mixed-Integer Linear or Nonlinear Program (MILP or MINLP). It can be solved using standard methods for MILPs (Nemhauser and Wolsey, 1988) or MINLPs (Floudas, 1995; Grossmann, 1996). In the case of nonconvex MINLPs, global optimization algorithms can also be used (e.g. Adjiman et al., 2000). Commercially available software to solve such problems include GAMS/CPLEX for MILPs, GAMS/DICOPT for MINLPs. Branch-and-price algorithms that can solve problems with large numbers of binary variables efficiently are currently an area of active research and are likely to have a significant impact on our ability to solve larger molecular design problems (Barnhart et al., 1998).

4.3 IDENTIFICATION OF FORBIDDEN BONDS BETWEEN G R O U P S

Due to the presence of large number of first order groups in the set proposed by Marrero and Gani (2001), some molecules can be multiply described. Rules 3 to 5 enable the identification of correct descriptions and give rise to forbidden bonds. These are incorporated within the mathematical formulation through logical constraints of type (31). In this section, we present a systematic methodology to identify forbidden bonds and the corresponding constraints. The full list of such constraints is not given here because of its size. We split the first order groups into four distinct classes, as listed in the rightmost column of Table 1.

Class A - Aromatic groups There are 47 such groups.

Class U A - Ureas and amides excluded.

There are 16 such groups. Aromatics are

Class U A S - Urea/amide subgroups These are non-aromatic groups that can be paired to give ureas and amides, i.e. groups 33 to 36 and groups 57 to 61 and 65.

Class S - Standard groups groups.

These are the remaining 109 first order

Several cases can be distinguished.

1. One aromatic group and a non-aromatic group are combined.

83

If the s t ructure formed by these two groups is unique, tha t is, it cannot be found in the list of first order groups and it cannot be formed by l inking different a r o m a t i c - non-aromatic groups, the bond is allowed. This is the case of the combination of groups 21b

(aC- CH b) and l a (CH~) to give aC-CH2-CH3.

If the s t ructure is not unique, then the only allowed combination is the one containing the heaviest aromatic/cyclic group. All other combinations mus t be forbidden. This is the case for instance of the s t ructure aC-COO-CH2 which can be formed by combining groups

18b ( a C b) - a48b ( C O O a,b) - 2a (CH~), or 45b (aC-COO b) - 2a (CH~).

The combination (18b-a48b-2a) violates rule 4 and is therefore disallowed as follows

Z Z (Y(i,48,a),(J1,18,b) + Y(i,48'b),(J2'2'a))< 1,'7'i

j l e Z j 2 e V

e V. (35)

, One urea/amide group and a s tandard group or urea/amide subgroup are combined.

If the s t ructure can only be obtained from this combination of groups, the combination is allowed. This is the case for instance of

CH3CONH2 obtained from groups 86a (CONH~) and l a (CH~).

If the s t ructure can be obtained through another combination involving a urea/amide group, the combination containing the heaviest urea/amide group is allowed. For instance, group 101

(NH2CONH) can also be generated by combining groups 65a (NH~)

and 107b (NHCOb). However this is not allowed as group 101 is heavier than group 107. This is enforced through

Z Y(i,65,a),(j,lO7,b) = O. (36) i , j

3. Two urea/amide subgroups are combined.

If the two groups do not form a urea or amide, the bond is allowed. This is the case for instance if groups 34a (CH2CO a) and 58b (CH2NH b) are bonded to form CH2CO-CH2NH. If the two groups are bonded through a CO-N bond, the bond is forbidden. This is the case for instance if groups 34a and 58a are bonded to form CH2CO-NHCH2. The bond is prevented through the constraints

Z Y(i,34,a),(j ,58,a) = O. (37) i , j

84

4. Two or more s tandard groups are combined o r one urea/amide subgroup and a s tandard group are combined. �9 If the s t ructure can only be obtained from this combination of first

order groups, it is unique and thus allowed. This is the case for instance of CH3COOH which can only be built by l inking group l a

(CH~) and group 31a (COOHa).

�9 If the s t ructure is itself a s tandard first order group or a urea/amide subgroup, the combination is not allowed. This is the case for

instance of CH2C1 which can be built from groups 2a (CH~) and

130a (Cla), but is also found in the first order group list (group 108). The bond is prevented through the constraint

Z Y(i,2,a),(j ,130,a) O. (38) i , j

Such a s t ructure can also be obtained by combining more than two groups. For instance, the combination of groups 29a (OHa), 2a

(CH~) and 50b (CH20 b) results in the formation of first order group

134 (OCH2CHzOH). The combination of the three groups is thus excluded through the set of constraints

Z Z (Y(i,2,a),(j~,29,a) + Y(i,2,a),(jz,50,b ))<- 1,'v'ie V. j l 6 V j z e V

(39)

If the s t ructure can be generated by a different combination of first order groups, only the combination containing the heaviest first order group is allowed. This is the case for instance of NO2CH2COO

which can be built from groups 41b (CH2COO b) and 81a (NO~) or

from groups 77a (CH2NO~) and 48a (COOa). Since group 77 is the

heaviest of the four groups involved, the combination (77a,48a) is favoured and the combination (41b,81a) is excluded through the constraints

Z Y(i,41,b),(j,81,a) O. (40) i , j

4.4 A P P L I C A T I O N E X A M P L E

4.4.1 P r o b l e m descr ipt ion

A small example is used to i l lustrate the application of the proposed formulation. The design of an aromatic compound with up to two rings is considered. The design specifications are the minimizat ion of the s tandard

85

heat of fusion, given a maximum value for the melting point (Tin,max) and a minimum value for the boiling point (Tb, min) of the compound. A set of fifteen first order groups is considered, as listed in Table 2. The maximum compound size is N/max -- 20 groups.

Table 2: First order groups (group number) used in applicat ion example CH3 (1) CH2 (2) CH (3) C (4) aCH (15) AC (16) aC (18) aCCH3 (20) aCCH2 (21) aCCH (22) aCC (23) OH (29) aCOH (30) COOH (31) aCOOH (32)

4.4.2 P r o b l e m f o r m u l a t i o n

Sets

In order to formulate the problem, the following sets are defined:

Set of first order groups: G1 - {1, 2, 3, 4, 15, 16, 18, 20, 21, 22, 23, 29, 30, 31, 32}. Set of non-aromatic first order groups: GI,={1, 2, 3, 4, 29, 31}. Set of aromatic first order groups: Gla={15, 16, 18, 20, 21, 22, 23, 30, 32}. Set of vertices: V = {1, .., 20}. Set of bond types: T = {a, b}. Set of aromatic rings: W = {1, 2, 3,4}.

Variables

The binary variables needed are

�9 Ui, k, Vie V, Vke G1 denoting whether group k is present at vertex i, �9 y(i,k,t),(j,kk, tt) Vi,j~ V, Vt, tte T, Vk, kke G1 denoting whether group k at

vertex i is linked via a type-t bond to a type tt bond of group kk of vertex j,

�9 Raw, Vwe W denoting whether aromatic ring w exists, �9 ri, w, Vie V, V w e W denoting whether vertex i belongs to aromatic ring

W.

The following (non-negative) continuous variables are defined:

�9 Tm the melting point of the compound (in K), �9 Tb the boiling point of the compound (in K), �9 Hfus the standard heat of fusion of the compound (in kJ/mol), �9 nlk, Vke G1 the number of groups of type k in the compound, �9 Rtot the number of aromatic rings in the compound.

Although nlk and Rtot a r e defined as continuous variables, they are both forced to take on integer values via constraints (8) and (15) respectively.

86

Data

The data needed consist of the valency of each group in G1, and the contributions to melting point, boiling point and heat of fusion for all these groups. These are listed in Table 3. The contributions for each group are taken from Table 6 of Morrero and Gani (2001). Three constants are defined for use in the property prediction equations: Tmo- 147.50 K, Tbo = 222.543 K and Hfus, O = 5.549 kJ/mol.

Table 3: Group data for the application example Group k Number Vk, a Vk, b Tm, k Tb, k Hfus, k

(K) (K) (kJ/mol) CH3 1 1 0 0.6953 0.8491 1.660 CH2 2 2 0 0.2515 0.7141 2.639 CH 3 3 0 -0.3730 0.2925 0.134 C 4 4 0 0.0256 -0.0671 -1.232 ACH 15 2 0 0.5860 0.8365 -1.037 AC 16 3 0 1.8955 1.7324 0.845 AC 18 2 1 0.9176 1.5468 -0.531 aCCH3 20 2 0 1.0068 1.5653 2.969 aCCH2 21 2 1 0.1065 1.4925 0.948 aCCH 22 2 2 -0.5197 0.8665 -1.037 aCC 23 2 3 -0.1041 0.5229 -0.2856 OH 29 1 0 2.7888 2.5670 4.786 aCOH 30 2 0 5.1473 3.3205 8.427 COOH 31 1 0 7.4042 5.1108 10.692 aCCOOH 32 2 0 12.4296 6.0677 14.649

Objective function

The objective function is

min Hiu s (41)

Constraints

The design specifications are

Tb,mi n _< T b ( 4 2 )

T m <-- Tm,ma x ( 4 3 )

To obtain a mixed-integer linear problem, these are reformulated as

(,Tb,min) exp(, Tbo Grbe (44)

87

[ T~,max / Tme < exp Tmo (45)

The property prediction constraints are given by

Tree = ~ T~,kn, k (46) ke G l

The = ~_, Tb,kn, k (47) keG,

H:, s - H : . . , o - ~ H:.,.,kn, k (48) ke G i

To reflect the fact that no non-aromatic rings are allowed in the molecule, constraint (29) is re-expressed as

Rto, = ~_~ Raw (49) ~ W

Applying the methodology of section 4.3 to the set of groups in G1, the following set of forbidden bonds is identified: (18b, la), (18b, 2a), (18b, 3a), (18b, 4a), (18b, 29a), (18b, 31a). Thus 18b cannot be bonded to any of the non-aromatic groups. This is expressed as

Z Z Y(i,18,b),(j,k,t) -- 0 ( 5 0 ) keG~. i , j , t

Constraint (26) is reduced to account for the fact that a maximum of four rings is allowed. It is given by

Y(i,16,a),(j,16,a) + ri,w 1 + ~,w 2 + ri,w 3 + ri,w 4 <- 2 + Z r i ,w 'Vi ' J e V, w~ W

Vw~ , w2 , w3 , w4 , ws , w 6 �9 W , w 2 r Wl;W 3 r Wl , Wz ; W 4 r Wl , Wz , W3.

(51)

Constraints (8)-(15), (20)-(25), (27)-(28), (30), (33)-(34), (44)-(51) are included in the formulation. Constraints (16) to (19) are not included because they deal with cyclic compounds.

R e s u l t s

The problem is solved with Tm, max- 410 K and Tb, m i n = 500 K. The three runs described in Table 4 were solved using GAMS/CPLEX. The runs are designed to test the formulation by generating different types of aromatic compounds. All runs were successfully completed and the results are presented in Table 5. The problem was also at tempted using a formulation based on the variable type zi, t,j instead of the y(i,t,k),fj, kk, tt) variables. However, due to the large number of constraints that must be introduced

88

to describe the restrictions on aromatic groups, convergence was not achieved.

Table 4: Description of runs performed for application example Run number Run description 1 Design an aromatic compound 2 Design an aromatic compound with at least two rings 3 Design an aromatic compound with at least two non-fused

rings ......

Table 5: Results for the application example. Numbers parenthesis are experimentalvalues from Afeefy et al. (2001).

Run Compound name Compound T~ Tb Hfus Structure (K) (K) (kJ mol

1)

1 1,3,5 " ~ 218 517 (507- 18.2 triisopropylb enzene ~ (266) 511)

2 Naphthalene ~ 315 516 (491) 22.8 (353)

3 Biphenyl ~ ff--~ 301 543 (527) 24.0 (343)

4.5 CONCLUSIONS

Connectivity-based property prediction techniques are becoming increasingly important in the context of computer-aided molecular design tools. In this chapter, we have proposed an MINLP formulation, which enables the use of a group contribution method, based on a large set of functional groups. While a large number of binary variables are used to represent the vertex adjacency matrix, the model results in a comparatively small number of constraints, which reduce the symmetry of the problem and the computational effort required. The fact that the functional groups may possess more than one bond type is reflected in the formulation. This is used as a basis for the design of cyclic and aromatic compounds. General constraints are derived for molecules with an arbi trary number of rings, whether cyclic or aromatic. The proposed approach also enables a number of rules to be applied to the designed molecules, either to ensure more accurate property prediction, or to impose design requirements on the type of candidate molecules that are identified. It is also possible to use simultaneously several property prediction methods which are based on different sets of groups and which

89

require connectivity information. By varying the type of compound to be designed, or by introducing integer cuts, the methodology yields a list of candidate molecules ranked on the basis of optimality. The approach was demonstrated on an illustrative example for the design of aromatic compounds.

The use of optimization techniques in computer-aided molecular design enables the implicit evaluation of a large number of alternative structures. This is especially true when the evaluation step is computationally demanding, as is the case when process performance is a selection criterion for the molecule. Furthermore, the ability to carry out an implicit search in the space of feasible compounds and to represent the full connectivity of the molecule is likely to be an important requirement in using the more accurate property prediction techniques that are currently become available, in particular through advances in computational chemistry.

4.6 N O M E N C L A T U R E

Greek letters z Vector of physical properties for a compound Vk, t Valency of bond type t in group k

Roman letters C2

C3

Ci Di Ei f F g G1 Gin

V i a

V ie

G2 G3 h Hfus Hfus, O Hfus, k

n l i

n2i

n3i

Yl,max pv

Weight in property estimation equation Weight in property estimation equation Contribution of first order group i Contribution of second order group i Contribution of third order group i R.H.S. of property prediction equation Performance index Vector of inequality constraints Set of first order groups Set of non-aromatic, non-cyclic first order groups Set of aromatic first order groups Set of cyclic first order groups Set of second order groups Set of third order groups Vector of equality constraints Standard heat of fusion (kJ/mol) Constant used to calculate heat of fusion (kJ/mol) Contribution of group k to the heat of fusion Number of first order groups of type i in compound Number of second order groups of type i in compound Number of third order groups of type i in compound Maximum number of first order groups in compound Number of vertices in molecular graph

90

q

ri, w

Raw

Rtot T Tb Tbo Tb, k Tbmin T~ T, no Tm, k Tin, max Ui, k

V W X y

Number of edges in molecular graph Vector of binary variables denoting whether vertex i is in aromatic ring w

Vector of binary variables denoting whether aromatic ring w exists

Variables denoting the total number of rings Set of bond types Normal boiling temperature (K) Constant used to calculate boiling point (K) Contribution of group k to the boiling point Minimum value of the boiling point (K) Normal melting point (K) Constant used to calculate melting point (K) Contribution of group k to the melting point Maximum value of the melting point (K) Vector of binary variables denoting whether group k exists at vertex i Set of vertices Set of aromatic rings Vector of continuous process variables Vector of integer variables describing the structure of a compound

Subscr ip ts a Bond type b Bond type c Bond type i Vertex number j, j l , j2 Vertex number k, kk Group type t, tt Bond type w, w l, w2, w3 Ring number w4, w5, w6 Ring number

Superscr ipts a Bond type b Bond type c Bond type L Lower bound U Upper bound


[1]

[2]

C.S. Adjiman, I.P. Androulakis, C.A. Floudas, Global opt imiza t ion of mixed-integer nonl inear problems, AIChE J., 46 (2000) 1769 H.Y. Afeefy, J.F. Liebman, and S.E. Stein, Neutra l Thermochemical Data in NIST Chemistry WebBook, NIST Standard Reference

91

Database Number 69, Eds. P.J. Linstrom and W.G. Mallard, National Institute of Standards and Technology, Gaithersburg MD, (http ://webbook.nist. gov) (2001).

[3] C. Barnhart, E.L. Johnson, G.L. Nemhauser, M.W.P. Savelsbergh, P.H. Vance, Branch-and-price: column generation for solving huge integer programs, Oper. Res., 46 (1998), 316.

[4] E.A. Brignole, S. Bottini and R. Gani, A strategy for the design and selection of solvents for separation processes, Fluid Phase Eq., 29 (1986) 125.

[5] A. Buxton, A.G. Livingston and E.N. Pistikopoulos, Optimal design of solvent blends for environmental impact minimization, AIChE J., 45 (1999) 817.

[6] K.V. Camarda and C.D. Maranas, Optimization in polymer design using connectivity indices, Ind. Eng. Chem. Res., 38 (1999) 1884.

[7] C.F. Chueh and A.C. Swanson, Estimation of liquid heat capacity, Can. J. Chem. Eng., 51 (1973), 596.

[8] N. Churi and L.E.K. Achenie, Novel mathematical programming model for computer aided molecular design, Ind. Eng. Chem. Res., 35 (1996) 3788.

[9] L. Constantinou, K. Bagherpour, R. Gani, J.A. Klein and D.T. Wu, Computer aided product design: Problem formulations, methodology and applications, Comp. Chem. Eng., 20 (1996) 685.

[10] L. Constantinou and R. Gani, New group contribution method for estimating properties ofpure compounds, AIChE J., 40 (1994) 1697.

[11] L. Constantinou, S.E. Prickett and M.L. Mavrovouniotis, Estimation of thermodynamic and physical properties of acyclic hydrocarbons using the ABC approach and conjugation operators, Ind. Eng. Chem. Res., 32 (1993), 1734.

[12] V. Dua and E.N. Pistikopoulos, Optimization techniques for process synthesis and material design under uncertainty, Chem. Eng. Res. Des., 76 (1998) 408.

[13] A. Duvedi and L. Achenie, Designing environmentally safe refrigerants using mathematical programming, Chem. Eng. Sci., 15 (1996) 3727.

[14] C.A. Floudas, Nonlinear and mixed-integer optimization: Fundamentals and applications, Oxford University Press, Oxford (1995).

[15] GAMS, Generalized Algebraic Modeling System, www.gams.com. [16] R. Gani and E.A. Brignole, Molecular design of solvents for liquid

extraction based on UNIFAC, Fluid Phase Eq., 13 (1983) 331. [17] R. Gani, B. Nielsen and A. Fredenslund, A group contribution

approach to computer-aided molecular design, AIChE J., 37 (1991) 1318.

[18] I.E. Grossmann, Mixed-integer optimization techniques for algorithmic process synthesis, Advances in Chemical Engineering, 23 (1996), 171.

[19] F. Harary, Graph Theory, Addison-Wesley, Reading, 1969.

92

[20] P. Harper, R. Gani, P. Kolar and T. Ishikawa, Computer-aided molecular design with combined molecular modeling and group contribution, Fluid Phase Eq., 160 (1999) 337.

[21] A.L. Horvath, Molecular design: Chemical structure generation from the properties of pure organic compounds, Elsevier, Amsterdam (1992).

[22] K.G. Joback and G. Stephanopoulos, Designing molecules possesing desired physical property values, Foundations of Computer Aided Process Design, (1989) 363.

[23] K.G. Joback, Designing molecules possesing desired physical property values, Ph.D. thesis, MIT, Cambridge (1987).

[24] L.B. Kier and L.H. Hall, Molecular connectivity in chemistry and drug research, Academic Press, New York (1976).

[25] S. Macchietto, O. Odele and O. Omatsone, Design of optimal solvents for liquid-liquid extraction and gas absorption processes, Chem. Eng. Res. Des., 68 (1990) 429.

[26] C.D. Maranas, Optimal compute- aided molecular design: A polymer design case study, Ind. Eng. Chem. Res., 35 (1996) 3403.

[27] C.D. Maranas, Optimal molecular design under property prediction uncertainty, AIChE J., 43 (1997) 1250.

[28] E.C. Marcoulaki and A.C. Kokossis, On the development of novel chemicals using a systematic synthesis approach Part I. Optimisation framework, Chem. Eng. Sci., 55 (2000a) 2529.

[29] E.C. Marcoulaki and A.C. Kokossis, On the development of novel chemicals using a systematic synthesis approach Part II. Solvent design, Chem. Eng. Sci., 55 (2000b) 2547.

[30] J. Marrero and R. Gani, Group-contribution based estimation of pure component properties, Fluid Phase Eq., 183-184 (2001) 183.

[31] J. Marrero-MarejSn and E. Pardillo-Fontdevila, Estimation of pure compound properties using group-interaction contributions, AIChE J., 45 (1999) 615.

[32] M. Mavrovouniotis, Product and process design with molecular-level knowledge, in First International Conference on Intelligent Systems in Process Engineering, Eds. J.F. Davis, G. Stephanopoulos, V. Venkatasubramanian, AIChE Symp. Ser. 312, 92 (1996) 133.

[33] G.L. Nemhauser and L. Wolsey, Integer and Combinatorial Optimization, Wiley, New York (1988).

[34] D.E. Needham, I.C. Wei and P.G. Seybold, Molecular modeling of the physical properties of alkanes, J. Am. Chem. Soc., 110 (1988) 4186.

[35] O. Odele and S. Macchietto, Computer aided molecular design: A novel method for optimal solvent selection, Fluid Phase Eq., 82 (1993) 47.

[36] E.N. Pistikopoulos and S.K Stefanis, Optimal solvent design for environmental impact minimization, Comp. Chem. Eng., 22 (1998) 717.

[37] B.E. Poling, J.M. Prausnitz and J.P. O'Connell, The properties of gases and liquids, McGraw-Hill, New York, 5 th edition (2000).

93

[38]

[39]

[40]

[41]

S. Raman and C.D. Maranas, Optimization in product design with properties correlated with topological indices, Comp. Chem. Eng., 22 (1998) 747. N.V. Sahinidis and M. Tawarmalani, Applications of global optimization to process and molecular design, Comp. Chem. Eng., 24 (2000) 2157. S. Siddhaye, K. Camarda, E. Topp and M. Southard, Design of novel pharmaceutical products via combinatorial optimization, Comp. Chem. Eng., 24 (2000) 701. R. Vaidyanathan and M. E1-Halwagi, Computer-aided synthesis of polymers and blends with target properties, Ind. Eng. Chem. Res., 35 (1996) 627.


Chapter 5: Genet ic Algorithms Based CAMD

P. R. Patkar & V. Venkatasubramanian

Designing new molecules possessing desired properties is an important activity in the chemical, material and pharmaceutical industries. Much of this design involves an elaborate and expensive trial-and-error process that is difficult to automate. A CAMD approach using genetic algorithms (GAs) or genetic programming is presented. Unlike traditional search and optimization techniques, genetic algorithms perform a guided stochastic search where improved solutions are achieved by sampling areas of the search space that have a higher probability for good solutions. Moreover, GAs allow for the direct incorporation of higher-level chemical knowledge and reasoning strategies to make the search more efficient. A background of GAs and the implementation of GA-based search are presented followed by a discussion on the theory behind genetic search. Two polymer design case studies are discussed and an evolutionary design framework based on genetic algorithms is presented for the problems. Results from the studies are presented and some general conclusions are offered.

5.1 INTRODUCTION

The very first chapter of the book highlighted the importance of product design. Clearly, the design of new materials possessing desired properties is a very important activity in the chemical, material and pharmaceutical industries. Computer-aided molecular design (CAMD) offers a very attractive alternative to the traditional trial-and-error experimental approach, particularly since the latter often turns out to be highly protracted and expensive. The application areas of material design are diverse and encompass polymers, polymeric composites, blends, paints and varnishes, refrigerants, solvents, drugs, pesticides, and so on. The focus of this book is primarily molecular design, a special case of the broader material design problem. Several examples exist of the successful design applications of CAMD namely solvents [1,2], refrigerants [3,4], pharmaceutical products [5] and polymers [6-10]. Part II of the book presents some of these applications.

In general, the overall task of CAMD requires the solution of two sub- problems: the forward problem, which involves the computation of some performance measures or physical, chemical and/or biological properties from the product structure and composition; and the inverse problem,

96

which entails the identification of the appropriate molecular structure or composition given the desired macroscopic properties. This is illustrated in Fig. 1. Various methods can be employed for the estimation of properties from the structure. Approaches typically used include those based on group contribution [11-14], topological indices [15-18], molecular modeling [19] or their combination [20]. Depending on the problem, the prediction method could be more general and include highly nonlinear neural networks or other black-box models as well [21].

Fig. 1. Components of the molecular design problem

The solution to the inverse problem, which involves the systematic identification of viable structures, is a non-trivial task. A variety of techniques have been employed for the inverse method including knowledge-based systems [22, 23], machine-learning techniques [24], graph reconstruction methods [25, 26] and enumeration-based algorithms [27-29]. A number of rigorous mathematical formulations have also been proposed [1-10] and solved for several design applications. Some of these methods, both for the forward and inverse problems, have been discussed in detail in other chapters of the book.

In general the desirable features of any inverse solution method are

�9 Generality of application, �9 Ability to handle nonlinear objective functions and the resulting

local optima, �9 Ease of implementation and adaptability, �9 Computational ease in handling large search-spaces, �9 Robustness to approximations/uncertainties in the property

predictors.

97

In spite of their advantages in certain specific problem domains, all the methods mentioned above typically lack one or more of these features. An inverse solution s t ra tegy based on genetic algori thms [30, 31], which forms the focus of this chapter, is able to overcome most of these difficulties for several design applications.

5.2 GENETIC ALGORITHMS & GENETIC PROGRAMMING

5.2.1 B a c k g r o u n d

Genetic algori thms are a method for stochastic, evolutionary search. The under lying idea of the genetic algori thm is drawn from the Darwinian model of na tu ra l selection and evolution. Pioneering work on GAs was done by Holland [30]. Detailed discussions on GA fundamenta ls and applications can be found in Goldberg [31], Davis [32], Rawlins [33] and Man et al. [34]. The original idea presented by Holland is as follows: consider tha t every candidate solution to a given search problem can be represented in a 'genetic' form called the chromosome, with a one-to-one mapping between the solutions in the state space and their corresponding genetic forms. It is assumed tha t a solution to the forward problem already exists so tha t any point in the state space can be evaluated in t e rms of the objective of the search. Since the mapping between the s ta te space and the chromosomes is one-to-one, the quali ty of any candidate solution is completely determined by its genetic information i.e. its chromosome. Therefore the terms solution and chromosome are often in terchangeably used in a genetic search.

The best feasible solution to a given search problem must have a corresponding chromosome under a given one-to-one, invertible mapping,

from state space, ~2 to the genetic space, W. Let this chromosome be called the target chromosome. Then the basic assumption in a genetic search is t ha t a given gene pool (collection of chromosomes) can potential ly lead to the ta rge t chromosome by the process of evolution, which is the creation of new chromosomes from existing ones via exchange of genetic information. The process of evolution is carried out in a manner s imilar to tha t in living systems based on na tura l selection and the law of 'the survival of the fittest' . Natura l selection in living systems implies tha t s tronger individuals are more likely to survive and win in a competing environment . In other words, fitter individuals are more likely to produce bet ter offsprings. In a GA, the fitter chromosomes are those tha t are closer to the target . Then the implementa t ion of the algori thm is carried out such tha t the fi t ter chromosomes are rendered with bet ter chances of passing their genetic information to the subsequent generat ions of chromosomes. The process of evolution is carried out for a pre-decided number of generat ions or till the target chromosome is obtained.

98

Figure 2: Framework for implementation of a genetic algorithm

5.2.2 Implementat ion

The framework for the implementation of a genetic algorithm or a genetic program is shown in Figure 2. The process starts with a collection of chromosomes. Each chromosome is assigned a fitness depending upon its proximity to the target. The fitter chromosomes are selected as 'parents ' and they are allowed to exchange or alter their genetic information to create offspring. This is achieved by means of operators called genetic operators. A new population or generation of offspring is created to replace the existing population. This is the process of evolution, which is repeated for a pre-decided number of generations or till the target is located. There are five main aspects to the overall procedure namely (1) genetic encoding, (2) assignment of fitness, (3) selection of parents for reproduction, (4) genetic operations and (5) replacement of existing chromosomes with newly evolved ones. These are discussed in detail below.

Genetic Encoding

Genetic encoding is the process of devising a one-to-one, invertible map, ~0 tha t represents every point in the original state space, ~2 of the problem in a corresponding point in the genetic space, W. The state space

99

representa t ion of a candidate solution is called the 'phenotype' and the genetic information i.e. the corresponding chromosome, the 'genotype'. The terms genotype and phenotype have been derived from living systems where the phenotype is what is obtained when the genetic information is decoded or 'expressed'. Using ~p-1 the genetic information of a chromosome can be decoded to get the original point in the state space. Depending upon the problem, there could be more than one way of mapping the two spaces. However, in most cases, a convenient genetic representat ion of the state space variables arises natural ly from the problem description itself. This is part icularly t rue when some or all the variables are symbolic.

The chromosome consists of one or more 'genes'. Each gene is simply a sequence of one or more units on the chromosome. In a classical genetic algorithm, the units are binary i.e. they have value 1 or 0. One may view tha t a value of 0 indicates a 'recessive' part of a gene and 1, an 'active' part. More generally, a gene could consist of units tha t are not restricted to binary values but can take on symbolic or numeric values. When the values are not restrictively binary, the overall procedure is called a genetic program instead of a genetic algorithm. However, the la t ter term is used loosely to refer to both the evolutionary procedures. All possible combinations of the values of the units of a gene determine the different values called 'alleles' possible for the gene as a whole. Different genes in a chromosome can have different number of units. Moreover, there can be units of more than one data-type (binary, numeric or symbolic) in a given chromosome and sometimes even in the same gene. Further , certain problem could require such an encoding wherein different chromosomes in a population have different numbers of genes. The hierarchical s t ructure of genetic encoding is shown in Fig. 3. for binary units.

Figure 3: Bit-string genetic encoding

It should be noted tha t the gene forms the fundamenta l uni t of information in a chromosome and an individual unit has little meaning unless considered in concert with the other units making up the gene. This is s imilar to the DNA of living systems where the individual units are nitrogenous bases but it is the sequences of these bases tha t determine the genes and the genetic make-up of an individual. The process of genetic encoding establishes a relation between the individual variables in state space and the genes. Therefore the encoding is essentially deciding the

100

number and types of genes used to make up the chromosomes, and the mapping between them and the state variables.

F i t n e s s F u n c t i o n

The fitness of a chromosome is a positive value indicating its quality or degree of 'goodness'. Therefore it is obviously related to the objective of the search problem at hand. A given chromosome must have a unique fitness value, which implies that fitness must be a function of the genotype. Since there is a one-to-one correspondence between the genotype and phenotype, the function can also be expressed as a function of the phenotype. This is called the fitness function. This is usually very closely related (if not identical) to the original objective function of the search problem defined over the state space. In several cases, particularly those in which the optimal objective function value is known beforehand, it is convenient to devise a fitness function whose range is the interval [0,1]. The fitness value of the target chromosome is 1 and the extent of departure from 1 is a measure of the 'distance' from the target for other chromosomes.

S e l e c t i o n of P a r e n t s

It is important to devise a selection mechanism that will simulate the phenomenon of natural selection in evolution. Thus, the fitter the chromosome, the greater should be its chances of being selected as a parent. There are several schemes for selection of parents, some of which are given below:

Random selection: Parent chromosomes are randomly picked for reproduction from the current population. Such a scheme does not incorporate the idea of natural selection. Random selection is rarely used.

Roulette Wheel selection: This is the most commonly used selection policy. Here the probability of selection of a chromosome is directly proportional to its fitness (hence the selection is also called fitness-proportionate selection). It is given by

F(i) P ( i ) - u

EF(Y) j=l

(I)

where P(i) is the probability of selection of chromosome i, F(i) is the fitness of chromosome i and N is the total number of individuals in the population. Consider a very simple example where the population size, N is 5. Let the chromosomes have fitness values 0.8, 0.5, 0.3, 0.25, 0.15. Then the probabilities of selection of the different chromosomes are given by the areas of the 'roulette wheel' as shown in Fig. 4.

101

Figure 4: Probabilities of selection in Roulette-Wheel selection

Commonly, the probability is determined by using cumulative fitness values instead of the actual or raw fitness. The raw fitness values are first scaled with that of the highest in the population to get the scaled fitness

s f ( i ) - F(i)

Fmax

(2)

Then the cumulative fitness of chromosome i is given by

curer(i) = ~ sf (j) (3) j=l

where sf(j) is the scaled fitness of chromosome j. Now the probability of selection of a chromosome is given by

~i~sf (J) P(i)- j=l

N

E#u) j=l

(4)

Using equation (4) instead of (1) favors the highly fit individuals even more during selection. The effectiveness of the roulette wheel policy strongly depends on the actual definition of the fitness function.

Rank Selection: The above drawback of roulette wheel selection is avoided by using rank selection where the selection of individuals is only on the basis of their rank in the population. However, roulette wheel is usually preferred over rank selection since the latter is better able to simulate the law of natural selection.

Tournament Selection: This type of selection is often used in optimizing game-playing strategies where two players adopting different strategies

102

are made to play against each other upon which the s t ra tegy of the winner is chosen. Similarly, here two chromosomes are picked and the one with the higher fitness is chosen as a parent for reproduction.

Genetic Operators: The process of creation of offspring chromosomes from paren ts is achieved by means of genetic operators. A genetic operator modifies the genetic information of one or more parent chromosomes according to some probability; otherwise it leaves the parent chromosome(s) unchanged. This probability is called the operation ra te of the genetic operator. Two genetic operators are pr imari ly used in a classical GA, namely crossover and mutation.

Crossover: Crossover involves the creation of two offspring chromosomes by exchange of contiguous chunks of units from two parent chromosomes. In a single-point crossover, one point is chosen as the crossover point. Each paren t chromosome is cut at the crossover point and the par ts are exchanged. This is shown in Figure 5a.

Some studies have shown two-point crossover to be superior to single- point in certain cases [35]. Recently other types of crossover such as mult iple-point i.e. n-point crossover [36] with n > 2 and uniform crossover [37, 38] have been proposed.

Mutat ion: Muta t ion operates on one chromosome at a time. It involves the modification of one or more units of the chromosome. A binary uni t when mu ta t ed becomes 1 if it was originally 0 or becomes 0 if originally 1. In general, a uni t of a chromosome, upon mutat ion, changes from its current value to some other allowable value. In single-bit mutat ion, (the te rm bit originates from the case of binary representations), a unit is randomly picked on the chromosome and mutated. In a string-wise mutat ion, every bit on the chromosome is muta ted with some probability. This is shown in Figure 5b where four of the total ten bits of the chromosome have been muta ted .

Figure 5a: Single-point crossover

103

Figure 5b: Length-wise mutation

The crossover operator results in drastic changes in the genotype, which results in huge leaps from one point to another in the state space. Thus by means of crossover the algorithm is able to rapidly navigate through several regions of the search space spread far apart from one another. Mutation, on the other hand, results in small changes in a chromosome. This t ranslates to small movements in a given region of the search space. Thus mutat ion is a local optimization operator. This combined exploratory and local-search ability of the GA is its most significant feature. The algorithm is able to quickly recognize the promising areas of the search space and then closely investigate each of them by local search.

In addition to crossover and mutation, one could devise other operators for creation of offsprings. Examples of such operators will be discussed in the polymer design case study presented in later sections. In the case of constrained optimization problems, it is often desirable to eliminate or at least discourage infeasible solutions in the population. The formation of infeasible solutions can be prevented upfront via suitable modifications to the operators. Such modified operators that only produce feasible solutions are called constrained genetic operators. An alternative way of tackling infeasibility is by providing a penalty on infeasible solutions so that their fitness values become very low. Then such poorly fit chromosomes will most likely get eliminated during the course of evolution as a result of na tura l selection. The argument in favor of allowing infeasible chromosomes at all is that they could contain some good genes despite being infeasible on the whole. Hence, by means of contributing the good genes, such chromosomes could eventually lead to fitter, feasible offsprings during evolution. Either policy may be adopted to tackle the problem of infeasibility.

Replacement Pol icy

By repeated selection of parents followed by application of the genetic operators, a number of offsprings can be created. Then, one of several strategies may be adopted to replace the current generation with the offspring generation. One such scheme is called generational replacement

104

where all the chromosomes in the existing population are replaced by the offspring [39]. In such a case, to maintain a constant population size of N, each generation will involve N offspring chromosomes.

A drawback of the above policy is that all the chromosomes, including the best, of the current population are discarded. Then, if some of the fitter chromosomes fail to produce offsprings in the succeeding generation, their good genes could be lost permanently. To avoid such an occurrence, the generational replacement policy is usually combined with a policy called elitism. In an elitist strategy, one or a few of the best chromosomes are directly passed on to the next generation. This conserves a fixed number of best solutions produced till that point of evolution. Though elitism can lead to a faster domination of a certain chromosome in the population, in general it has been found to improve the overall performance of the genetic procedure.

Some schemes replace only the worst chromosomes when new chromosomes are inserted into the population. Such schemes generate a small number of offspring. In other words, these are heavily elitist policies. Another policy is to replace parents by the offsprings produced by them. A problem with this is that highly fit parents need not always lead to good offspring chromosomes. Thus, used standalone, the policy can lead to the loss of good chromosomes. Another method involves replacement of the eldest chromosomes, which are the ones that have been in the population for more than a certain number of generations. Once again, the best chromosomes could get eventually discarded during evolution if such a policy were implemented. The most commonly used scheme is generational replacement with elitism.

5.3 THE A L G E B R A OF GENETIC A L G O R I T H M S

There are broadly two schools of thought as to why genetic algorithms really work: Schema Theory and Building Block Hypothesis. Recently, a generalization of the schema theory called Forma Analysis has been proposed [40]. The important features of each theory are presented.

5.3.1 S c h e m a T h e o r y

The s tandard theory is based on Holland's schema analysis presented in his pioneering work on GA's [30]. The schema theory or schema analysis operates in the genotype space. It is applicable when the chromosomes are linear strings of a fixed length and when the units take on a well-defined set of values. Schema analysis provides valuable insight into the operation of a genetic algorithm. There have been extensions to the analysis that enable tracing the evolution of individual strings for infinite and finite populations [41, 42, 43].

105

Figure 6: Three-dimensional cube as the genotype space

We present here a discussion on the schema theory from Man et al. (1999). In order to understand the meaning of a 'schema', let us consider an example where the genetic representation involves chromosomes consisting of three bits. Thus, the genetic space is the three-dimensional cube shown in Fig.6

The standard theory is based on Holland's schema analysis presented in his pioneering work on GA's [30]. The schema theory or schema analysis operates in the genotype space. It is applicable when the chromosomes are linear strings of a fixed length and when the units take on a well-defined set of values. Schema analysis provides valuable insight into the operation of a genetic algorithm. There have been extensions to the analysis that enable tracing the evolution of individual strings for infinite and finite populations [41, 42, 43].

The origin corresponds to the chromosome 000. In any genetic representation, each vertex of the genetic search space corresponds to a chromosome. Here, the total number of possible chromosomes is eight. The bit-strings of adjacent vertices differ by exactly one bit; in other words they are separated by a Hamming distance of 1. The shaded face of the cube consists of vertices represented by the string 0.* where '*' is used to represent '0 or 1' or a 'wild card' symbol. Binary strings containing one or more '*' are called 'schemata' ( s ingu la r - 'schema'). The number of fixed bit values that appear in a schema is called the order, o of the schema. For instance, in Fig. 6, the schema 0.0 corresponds to the left edge (shown by a thick line) of the shaded face of the cube. The order of this plane is 2. It matches the chromosomes 000 and 010, which make up the edge. The shaded front face of the cube is an order 1 plane and is represented by the schema 0.*. Here, schema 0.* corresponds to the four chromosomes that make up the vertices of the face. In general, every schema represents exactly 2 r chromosomes, where r is the number of 'wild card' symbols, *, in the schema template. In binary encoding when the length of each chromosome is L, every chromosome is a corner of the L-dimensional hypercube and belongs to 2 L - 1 different planes. Apart from the order of a schema, another important attribute is the distance between its

106

outermost fixed positions, which is called its Defining Length, 5. For instance, the Defining Length of the schema *0101" is 3 whereas that in the case of *010"1 is 4. The defining length is a measure of the compactness of the information contained in a schema.

The way a genetic algorithm utilizes schemata is as follows. The genetic search samples several chromosomes in each generation. Each such population of sample solutions provides information about numerous planes. In particular, planes of low order would likely be sampled by several solutions in the population. Every chromosome being sampled effectively results in 2 L - 1 planes being sampled. This is called the 'implicit parallelism' in genetic algorithms. The competition for survival between different chromosomes can be viewed at a higher level as the competition between the corresponding planes or schemata. Thus implicit parallelism means that many such schema competitions are being simultaneously evaluated and solved in parallel. Holland derived the Implicit Parallelism Lower Bound that states that the number of schemata processed in a single generation is O(N3), where N is the size of the population. Fitzpatrick et. al. [44] argued that for L>64 and 26 < N < 22~ the number of schemata processed was greater than N 3.

The membership of a schema at a given stage of evolution is defined as the number of chromosomes in the current population belonging to that schema. As will be shown later, during the course of evolution, a fitter schema has greater chances of survival and correspondingly its membership grows i.e. its representation in the population increases. By the same token, as a result of natural selection, the representation of all poorly fit schemata would decrease in the highly competitive environment. The schema theory suggests that such increase or decrease in the representat ion of competing schemata in the population is the outcome of genetic operations acting according to the relative fitness of the chromosomes belonging to the schemata. The Schema Theorem [30] gives a lower bound for the sampling rate of a given schema that is the rate of change of membership of the schema during evolution. It is derived as follows:

Since a schema is a collection of strings, we can associate an average fitness value with every schema at time (generation) t. Let /~( t )be the

average observed fitness of a given schema ~ at time t, i.e. the average fitness of all the members of the population at time t that are members of schema ~.

Let N~ (t) be the membership of schema ~ at time t. If fitness proportionate selection is adopted during reproduction, we can estimate the number of members of schema ~ in the next generation. If ~(t)is the average fitness

of the entire population at time t then the probability of selection for reproduction of a member of schema ~ (in a single string selection) is equal

107

to /2r Then the expected number of members in schema ~ in the

next generation is

E(N r (t + 1)) = N r (t) lx~~((tt ~ ,

(5)

Let

C , - -

~(0 (6)

A value of a>0 implies that the schema has an above average fitness and vice versa. Substituting equation (5) into (6), it can be seen that an 'above average' schema receives an exponentially increasing number of members in the subsequent generations:

E(Nr (t))= Nr (0X1 + e) t (7)

The above equation shows that the growth of an above-average schema is highly favored as a consequence of the fitness proportionate selection policy. However the above equation does not accurately reflect the sampling rate. The disruptive action of the evolutionary operators tends to decrease the membership of such schemata and needs to be incorporated in the sampling rate. Consider single-point crossover being applied over chromosomes of length L. The crossover point would, in general, be selected uniformly among L-1 possible positions along the chromosome. Then the probability of destruction of a schema ~ as a result of the crossover is

Pd(~) = ~ ~(~) L-1

(8)

where 5(~) is the defining length of schema ~. The probability that schema would survive the crossover is given by

Ps(~)- 1 - ~ 5(~) L-1

(9)

If the operation rate of crossover is Pc then the probability of survival of schema ~ is

108

5(r Pc

(10)

It should be noted that even if the crossover point occurs between fixed positions, schema ~ might still survive the operation. Therefore equation (10) has to be modified as

Ps(r 2 1 - P 5(r ~L-1

(11)

The effect of mutation can be similarly incorporated. Suppose that the probability of bit mutation is Pm. Then the probability of a single bit survival is 1-Pm. Therefore the probability of survival of schema ~ after a sequence of one-bit mutations is

Ps(~) = (1- Pm) ~162 (12)

where o(~) is the order of schema ~. Since Pm << 1, the above equation can be approximated as

Ps (~) = 1- PmO(~) (13)

Incorporating the disruptive effects of crossover and mutation into equation (5), an equation for the reproductive growth of schema is obtained as

E(Nr (t +1))> Nr fie(t)[1-P~ 5(~) ] - ~-(t) " ~ - PmO(~)

(14)

In general, in addition to crossover and mutation, several other operators may be applied. If ~ is the set of all genetic operators being used then the above equation can be stated as

E(N~ (t + 1))> Nr fie (t) [ )] - - oZPwpwr

(15)

where the term PwPw(~) quantifies the potential disruptive effect of the application of a genetic operator w e ~.

109

The generalized form of the schema growth equation derived above is the mathematical s tatement of the Schema Theorem or the Fundamental Theorem of Genetic Algorithms. The implication of the theorem is that short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations. Bridges and Goldberg [45] extended the schema theorem for binary schemata to replace the inequality with an equality by including terms for string gains as well as disruption terms.

It is important to note that the schema theorem applies equally for a given phenotype space, ~ and the corresponding genotype space, W, regardless of the mapping between the two spaces. Assuming that ~2 and W have the same size, there are as many as I~1! such mappings possible yet the schema theorem applies equally to each of them. This makes the theorem powerful and widely applicable. However the schema theory does have some limitations, the most obvious of which is that it is applicable only to binary representations having one-to-one correspondence between the chromosomes and the solutions in the phenotype space. However, in several problems, it is convenient and natural to use non-binary units. For instance, under real-valued or symbolic encoding, the theory cannot explain the mechanics of the genetic program. The schema theory also assumes the use of standard genetic operators such as crossover or mutation. A number of problems require the use of special problem- specific operators, which may often be constrained operators. All the above instances lack the theoretical backing of the schema theory. As a result, until recently, they were mostly considered to be heuristic approaches. Forma Theory is a recent generalization of the schema theory, which can provide theoretical support for such approaches to the same extent as schema theory does for classical GAs. A detailed discussion on Forma Theory is beyond the scope of this chapter. However the key features of the theory are briefly presented in the following section.

5.3.2 F o r m a Theory

The Forma Theory was developed by Radcliffe in a series of papers [40, 46- 49]. The theory does not require any specific representation and is therefore applies equally to non-binary encoding. The representation is considered purely as a matter of implementation and does not affect the analysis. Thus the theory is a generalization of the schema theory of classical GAs and is therefore more flexible.

The theory defines 'formae' as sets of solutions sharing a certain property assumed to be relevant to the solutions' fitness. Formae are simply extensions of the idea of the schema where the latter refer to a set of solutions sharing specific binary units. The theory presents some guidelines as to the properties required of operators with respect to such formae that enable a genetic program to actually work. In his analysis, Radcliffe suggested some standard operators for a given set of formae. The specifics of a given problem are incorporated by means of defining the

110

appropriate set of formae. Then the effect of s tandard operators is analyzed on the abstract search space. The theory is also able to examine the effect of non-standard, 'heuristic' operators.

5.3.3 B u i l d i n g Block Hypothes i s

Holland introduced the idea that for a GA to work efficiently, the string- based representation should be able to effectively reflect the structure of the search space. Ideally, certain bits or groups of bits (genes) in a chromosome should represent certain properties of the corresponding phenotype that have significant bearing on the fitness. The assumption, then, is that the combination of such 'good' genes would lead to highly fit solutions. Chromosomes having one or more good genes are simply short, low-order schemata whose fixed-value bits have significant contribution towards high overall fitness. Such high-performance schemata are called 'building blocks'. The building block hypothesis suggests that a genetic algorithm seeks near-optimal performance through the juxtaposition of such blocks. The agents responsible for the juxtaposition of building blocks are the genetic operators such as crossover and mutation. These operators have the ability to generate, promote and juxtapose building blocks to form the optimal or nearly optimal strings. Crossover tends to conserve the genetic information present in the parent chromosomes Therefore, when the chromosomes chosen for crossover are similar, their capacity to generate new building blocks diminishes. On the other hand, mutation does not conserve genetic information and can generate new building blocks radically. The building block hypothesis suggests that the encoding can critically determine the performance of the GA since the coding should be such that short, high-performance building blocks should not only be possible but also easy enough for the algorithm to locate quickly.

It should be noted that the above theories only offer possible explanations as to why GA's work. But in general, because of the heuristic nature of the search, no guarantees can be offered about convergence. However this very aspect of the algorithm enables the search to overcome problems presented by local minima traps or discontinuous spaces. Thus the heuristic nature of the GA is in a way both its strength and weakness.

5.4 GA-BASED CAMD: THE POLYMER DESIGN PROBLEM

The adaptation [50] and application of GAs [51-54] as a solution framework for CAMD is described in this section. It is i l lustrated via the polymer design problem" a common design problem in polymer engineering, which is the determination of a polymer structure that meets a number of physical properties constraints. Stated more specifically, the polymer design problem is to determine the repeat unit structure of a polymer, say--[--Xl--X2--..--XL--]n-- satisfying a set of desired macroscopic physical properties, where xi are functional groups.

111

Figure 7: GA framework for the polymer design problem

5.4.1 Propose d GA F r a m e w o r k

The proposed framework for the polymer design problem uses (i) the s tandard group contribution methods discussed earlier for the forward problem and (ii) an adaptation of the s tandard genetic algorithm for the inverse problem. Figure 7 shows the GA framework for polymer design. The s tandard GA is modified in three aspects: representation of molecules (polymer repeat units), creation of new operators in order to exploit chemical knowledge of molecular interactions and rearrangements , and fitness function in order to handle property constraints. The selection policy is the commonly used fitness-proportionate selection. Elitism as 10% of population size is incorporated into the replacement policy. A detailed discussion is presented in the following sections.

5.4.2 Molecule R e p r e s e n t a t i o n

A standard GA employs the bit-string encoding scheme as discussed earlier. However for the polymer design problem, if bit strings were used to represent molecular structures then one would need binary matrices to represent the groups present in the structure and their connectivity. Such a representat ion would not only make the overall scheme more complicated as a result of extensive bookkeeping, but also render the

112

representa t ion difficult to follow and interpret . A more suitable and na tu ra l representa t ion would be to represent chemical s t ructure as a s t r ing of symbols or functional groups. Under such an encoding, the s tr ing is composed of one or more genes, each of which represents an elemental , sub-s t ructura l or monomer unit. The units are functional groups on the main backbone chain and the side-chains.

E x a m p l e G r o u p s

Fl~.mental - -~ - - - - 0 - - - -F --~1

, , , , , , ~ l _ _ . ( ~ - ~ - ) X _ _ - c - N - - S, l [-.I s l r l ~ . . , _ ~ . ,,,' I I

0 H II II

Monomer: O d O I I

H H Symbol ic P o l y m e r R e p r e s e n t a t i on

[ (~H 20H( - ) I 1 -- r l

t - - - - - - -~

C:ll 2 OH 2

- - i !

- - 0 - - 0 - - II 0

=- (,(o o ) (( l l I J) (l# OI)))

i[(C BZ C) ((H H) NIL (H H)))

Figure 8: Molecular structure representation

Since the encoding is symbolic, the method is not a classical genetic algori thm but a genetic program. It is impor tan t to bear in mind tha t the problem involves a search over polymer repeat units tha t may be of different lengths. Consequently, the encoding does not require chromosomes to have a fixed number of genes. It will be seen la ter t ha t the operators can in fact modify the length of a parent chromosome to resul t in offspring of different length.

Figure 8 presents examples of the symbolic coding scheme represent ing molecular s t ructure as nested lists in Lisp [55]. For the example shown i l lus t ra ted in the figure, ((C C) ((H H) (H C1))), the first list of two Cs s tand for two carbon backbone units. The subsequent lists contain elements tha t are side-chain subst i tuents for each backbone unit in the order of the lists. It is necessary to emphasize once again tha t the adopted genetic encoding based on functional groups is a na tura l representat ion of the problem, which enables easy expression of the rich and complex chemistry of molecules. Fur the r it facilitates the integrat ion of any heuristic chemical knowledge tha t one might have about the problem into the genetic f ramework so as to speed up the search process. For instance, ins tead of s ta r t ing the initial GA population at random, a designer using the GA sys tem can s tar t with s tructures tha t he or she believes to be good guesses based on his or her experience.

113

5.4.3 F i t n e s s f u n c t i o n

For the polymer-design problem, two kinds of fitness functions are used depending on the na ture of property constraints. When one is designing for a target property value with some bounds (i.e. both upper and lower bounds on the desired value), the following Gaussian-like function is employed:

xp[ m n )/1 where Pi is the i th property value, Pi, max and Pi,min a r e respectively the maximum and minimum acceptable property values, which are used to normalize the property values and P~ is the average of the maximum and

min imum acceptable property values, respectively, which are used to normalize the property values. The index i ranges over all the property constraints tha t are applied.

For example, consider designing for a glass t ransi t ion tempera ture of 400 K (P~- 400 K), with Pi,max - 402 K and Pi,min - 398 K. Then, if for a

par t icular molecular candidate Pi is 420 ~ then the candidate is somewhat far from the desired value as indicated by its fitness of 0.29 (for

- 0.001). The function F ranges from 0 to 1, with 1 being the target molecule's fitness. The parameter a is the fitness decay rate tha t determines how the fitness values fall off as the solutions move away from the center of the target. The Gaussian fitness function is shown in Fig. 9.

The second type of fitness function used is a sigmoidal function. This is preferred when the design involves property constraints tha t have only a lower bound or an upper bound, but not both:

] § exp -

where PF=O.5,i is the property value for which the evaluated fitness is 0.5. It is taken to be the lower or the upper limit of the acceptable property constraints. PRange, i normalizes the property values so as to remove any bias of a single property on the overall fitness. The total fitness is taken as the mean of all individual property fitness values. The parameter controls the slope of the sigmoid. Figure 10 displays the fitness function for~= 10.

114

Figure 9: Gaussian fitness function

Figure 10: Sigmoidal fitness function

5.4.4 Adaptat ion of Genetic Operators

The molecular string representation offers an excellent platform to fully exploit the richness and variety of the chemistry of molecular evolution. Towards this end new genetic operators (in addition to the crossover and mutat ion operators) previously not found in the s tandard genetic algorithm literature, have been developed [52]:

Single-point Crossover

Figure 11 shows the single-point crossover operator. In this example, crossover occurs after position three of parent #1 and position two of parent #2 (as shown by the dotted lines). The offsprings are created by crossing-over the genes of the parents as shown. When the parents are chromosomes of different lengths as in the case of Fig. 12, the cut-off point is chosen by counting the genes from the left or the right in each parent. Obviously, the crossover operator can lead to offspring with chromosomes of lengths different than either parent.

115

Parent 1: Parent 2:

|

O-- C-- C-r O--O-- -- 0 - II

I

I

C, H3 ~ C,--/.~--O--,, GHa n

Offspr ing #1 Of fspr ing #2

i H HOH I

_ H H CHs O J n _ 0 0_1.

Figure 11: Operator for single-point crossover

M a i n - c h a i n M u t a t i o n and S i d e - c h a i n M u t a t i o n

These operators are analogous to the s tandard bit mutat ions. Main-chain and side-chain mutat ions involve the replacement of a randomly selected main- or side-chain group respectively by another chemically feasible group. The muta t ion operators conserve chemical consistency i.e. the valency considerations of each atom are properly satisfied after each operation. For instance, when a group on the main-chain is muta ted to another group, the side-chain groups are correspondingly re ta ined or removed according as the valency of the new group is equal to or less than the group tha t was mutated. Fig. 12 i l lustrates the main-chain and sidechain muta t ion operators.

Parenl: Offspring:

. . . . . . .

Ci ~l MainchainMutation..r.._ O ~ iii" 'iii G I

by -~)/- Parent:

$idechain Mutation I F

Repltioe~ - F by

Offspring:

I I C - - C

. . . . .

Figure 12: Main- and side-chain mutation operators

116

I n s e r t i o n and D e l e t i o n

The insert ion operator randomly inserts a group at a single main-chain or side-chain location. Similarly, the deletion operator randomly removes a small number of main-chain or side-chain groups. Removal of a sidechain group is equivalent to replacing the group with hydrogen. Insert ion and deletion operators always lead to a modification in the number of genes of the chromosome being operated. Examples of these operators are shown in Fig. 13 and Figure.

Figure 13: The insertion operator

Figure 14: The deletion operator

117

Parent 1: Parent 2:

, , i C-- C - O-- Blending O-- C-- C-- 0 I I I II H H H HJn _ 0 0

Offspring'

i I I I 0-- O--O--O--O-O-- --0--0 I I I I II II

_H H H H 0 0

Figure 15: The blending operator

T h e Blending Opera to r

The blending operator produces one offspring from the end-to-end connection of two parents. This essentially combines the attributes from both parents. Figure 15 shows the blending of two parent chromosomes. The blending operator radically increases the molecular length.

T h e Hop-Muta t ion Opera to r

When this operator is applied, a randomly selected gene of the molecule exchanges position with another randomly selected gene. Thus, the selected genes 'hop' into the positions occupied by each other. An example of the process is illustrated in Fig. 16. This facilitates small rearrangements in the ordering of the units in a molecule, thus causing a local search for the appropriate isomeric form that increases the fitness. The operation is equivalent to the mutation of two genes of the chromosome to two pre-decided values. Hence the operator is known as hop-mutation.

5.5 CASE STUDIES: RESULTS AND DISCUSSION

In this section, two short examples of the polymer design problem, taken from work done by Venkatasubramanian and co-workers [52], are presented. The first case study is based on design cases that had been investigated by Joback and Stephanopoulos [56] using their heuristic- guided enumeration approach. The performance of the genetic search framework is demonstrated for polymers considered by Joback and Stephanopoulos in their study. The problem was to design polymers that were satisfy the following property constraints:

118

Figure 16: The hop-mutation operator

Glass Transi t ion Temperature: Tg > 400 K

Volume Resistivity: R > 1 x 1016 ohm - cm

Thermal Conductivity: L > 1.6 x 10 -7 W m K

Permeabi l i ty to Oxygen: P(O2) < 1.0 cc-mil/100 in2/day/atm

Note tha t the property constraints had only one bound, lower or upper, but not both. Such constraints are easier to design for than those with both bounds and t ighter tolerances. The la t ter si tuation is discussed in the second case study. Given the open-ended nature of the constraints, the sigmoidal fitness function was chosen. The polymer groups considered for the search are the same as Joback's and are listed in Table 1.

Appropriate values for the genetic algorithm paramete rs such as the population size, operator probabilities, etc. are impor tan t for an efficient search. The various pa ramete r values used in the case studies are shown in Table2. Polymer molecules of length 2 to 10 groups were considered. A population of 100 members was used. Steady state reproduction was employed whereby the population remained fixed at all t imes. An elitist policy was used in which ten of the fit test members of the population from the paren t generation are directly passed unchanged to the next. These p a r a m e t e r values were chosen by Venka ta sub raman ian and co-workers after l imited experimentat ion. It should be noted tha t these might not be the opt imal pa ramete r values for the problem. A number of pa ramete r s can have a major impact on the design outcome and in fact, a sub-optimal set of pa ramete r s can possibly lead to failure in discovering the ta rge t solution. Paramet r ic sensitivity and robustness analyses for the polymer

119

design problem are briefly discussed in the longer case s tudy presented in chapter 13.

~ C H 2 ~

Table 1. Palette of groups for the first case study

- - C ( C H 3)2 m - - C ( C H 3)(C6H5) m

o

~ O - - C - - O ~ ~ C - N H ~ II II O O

~ C H ( C 6 H 3 ) - -

~ C H 2 - ~ C H 2 - -

~ O - - C ~ II O

II O

~ C F 2 ~ ~ C H C 1 - -

--cc12--

Tab:!e2":: G A parameters for thep~ Paramete r Value

. . . . . . . . Steady state population 100 ' Gauss ian fitness decay rate (a) 0.001

Sigmoid slope pa ramete r (~) 10 Maximum polymer length 10

Elit ist re tent ion with respect to population 10% size

Genetic Operator Probabilities: Crossover 0.2

Backbone muta t ion 0.2 Sidechain muta t ion 0.2

Hop 0.2 Deletion 0.1 Blending 0.1 Insert ion 0.0

Joback reports t ha t there are about 18,000 feasible molecules for this set of constraints and he lists fifty of them. The resul ts of the genetic a lgori thm are summar ized in Table 3. Each independent run of the GA consisted of evolution up to a max imum of 100 generations, with a s teady s ta te population of 100 molecules. The table shows the number of distinct polymers found as well as the total number of molecules. The total number typically includes several copies of the same polymer. One can see tha t each run was successful and hundreds of solutions were identified. The first solution was often found within the first 5-10 generations. When the runs were allowed to evolve for more generations (say, 500 or so), many

120

more solutions were found. As mentioned before, this is a relatively easy design problem since the constraints are open-ended and not tight.

Table 3. Resul ts for case s tudy 1. Ini t ial populat ion size - 100. Total generations - 100

No. Distinct Total No. Run #

Solutions Found Solutions Found . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1042 5278 2 1063 5274 3 1058 5204 4 1099 5434

5 1058 5161 6 1083 5530 7 1040 5215 8 999 5381 9 1032 5124 10 1049 5118 Average 1052.30 5271.90

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In the second case study, certain features were introduced to make the design problem more complex than the first case study. First of all, independent changes of the groups in the side chain as well as in the backbone of the polymers were allowed. In the first case study, the side chain groups could not be changed independently. Next the constraints were t ightened so that design problem was to identify a molecule whose property values were within • 0.5% of the target properties. It is impor tan t to note here tha t this tolerance was very tight and made the search more difficult. Previous efforts in molecular design had not considered such tight constraints. Lastly, the number of constraints was increased from four to five. The property constraints that were considered in this case study were density, glass transi t ion temperature , l inear thermal expansion coefficient, dielectric constant, and specific heat capacity. The properties were calculated by using van Krevelen's group contribution methods.

Mainchain groups

m C m ~ O - C ~ II

I o

m o r n

Sidechain groups

~ H ~CH3 ~ F ~CI

Figure 17: Base groups used in the second case s tudy

121

The main-chain and side-chain base groups chosen for this case s tudy are given in Fig. 17. These groups were chosen such tha t group contribution pa ramete r s were available for all the properties considered and tha t the molecules constructed by the genetic operators satisfied normal chemical bonding constraints. Feasibili ty constraints were programmed into the genetic algori thm in order to avoid chemically infeasible group combinations. This is another i l lustrat ion of the powerful ability of the GA-based approach to allow easy incorporation of complex chemical interact ions or a r rangement constraints. Three target polymers were selected tha t offered different levels of difficulty in design:

�9 Polyethylene te rephtha la te (PET), �9 Poly(vinylidene propylene) copolymer (PVP), �9 Polycarbonate of bisphenol-A (PC)

Polyethylene te rephtha la te is the simplest, and the polycarbonate is the most difficult of the three. This is so because PC has nonlinear group interact ions where the ordering of the groups mat te r in de termining the properties and hence the search space is more complex. The properties of these ta rge t molecules, computed using group-contribution are listed in Table 4. These were submitted, one molecule at a time, to the genetic design system as the target properties with a tolerance of i 0.5% in the property values.

Table 4: Target polymers and their properties Glass Thermal Spec i f ic Dielectric

Density transition expansion heat Target Polymer constant P, g/cm3 temperature coefficient capacity

................................................................................................................................... .Tg. , . . .K . . . . . . . . . . . . . . . . . . . O(,,...K-. 1. . . . . . . . . . . . . . . C p , J / k g ' K .......... l; ....................................

H H I I

L O O H H J n Polyethylene terephtha]ate

1.342 340 2.96 x 10 .4 1153 3.44

H F H H _ _ ~ t i I I "1

c--c-c-c---4F- I I I I I H F H CH3..In

Poly(vinylidene propylene) copo|yme

1.175 249 2.77 x 10 .4 1378 2.14

r ~ ,c.~ ~_~ _lO_C_O Lp/,__c Lp2 1

Polycarbonate of Bisphenol-A 1.184 437 2.85 x 10 .4 1134 3.00

122

Tables 5 and 6 summarize the performance of the algorithm averaged over fifty runs for different design scenarios. Two different design scenarios were considered. In the first, the program was asked to design monomers that varied in length from 2 to 7 units on the backbone, even though the target polymer's length was less than 7 (Tables 5a and 5b). This made the search more difficult as there were more possibilities with increased length. In the second test case, the permitted monomer length was from 2 to 10 units (Tables 6a and 6b). In each case, two different initializations of the start ing population were considered: (i) random monomer lengths with random backbone and sidechain groups selected from Fig. 17 (Tables 5a and 6a) and (ii) random carbon backbone of varying lengths with H sidechain (Tables 5b and 6b). The stochastic nature of the genetic algorithm necessitated the results to be averaged over several runs for each case in order to get statistically meaningful results. Each run terminated at the 200 th generation. All the runs employed the identical set of parameters given in Table 2. The gaussian fitness function was used this time since such a fitness function is more appropriate for bounded constraints.

Table 5: Results for random groups in the backbone and side-chain. Monomer length = 2- 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : : : . . . . . . . . . . . . . . . : . . . . . . : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : . . . . . : : : : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : : : . : : . . . . . . . . . . . . . . . . -

Avg. generation # Avg. # of solutions at Percentage of when first solution the end of GA Search runs Target Polymer was found successful

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - - - - : : : : : : : - - - - : . . . . . . . . . . . . . . . : . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . : . . . . . . . . . . : . : : _ . . . . . . . . . . . . . . . - : : : - . . . . . : : : _ : . ~ : . . . . . . . . . . . - - : . . . . . . . . . . . . : : _ _ : . . . . . . . - - . : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : - : : : : - . . . . . . . . . . . . . . - - - - - - : : - : : . . . : . . . . . . . . . . . . . . . . : : _ . : : : _ . _

Polyethylene terephthalate 28.2 10.5 100%

Poly(vinylidene propylene) 11.3 copo]ymer

14.0 100%

Polycarbonate of 41.0 3.9 100%

b is~p..henol-A ...........................................................................................................................................................................................................................

.............................. Tabl e 5b: Res.ults for random :CH2, .groups ................................. . .........................

Avg. generation # Avg. # of solutions at Percentage of when first solution the end of GA Search runs Target Polymer was found successful

P01yethylene ....... terephthalate 13.6 11.3 100%

Poly(vinylidene propylene) 11.3 copolymer

14.3 100%

Polycarbonate of 58.0 bisphenol-A

3.8 100%

123

The results for the success rate indicate that the genetic search did very well in general. For the polymers PET and PVP, the search discovered these polymers in every run. Furthermore, it discovered multiple instances of these polymers with exactly the same structure and also found them fairly quickly as seen from the low average generation count. In addition, it also found several other structures, which had very high fitness values (typically, 0.90 or better). It took longer to find the solution for L=10 in comparison with L=7, as the search space was larger for the former. It is interesting to note that for L=10, the genetic search discovered dimers as well as monomers.

With respect to computational effort, the longest run (for polycarbonate in Table 6b) took about 5 minutes in real-time (about 2 cpu secs) on a Sparc 10 workstation.

Table 6a: Resu l t s for r a n d o m groups in the backbone a n d s ide-chain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M o n o m e r L e n g t h = 2-10 . . . . . . . . . . . .

Avg. Avg. # of Avg. # of generation # monomers dimmers Percentage when first found at the found at the of runs

Target Polymer solution was end of the GA end of the successful found search GA search

Polyethylene terephthalate 28.4 9.1 7.8 100%

Poly(vinylidene propylene) copolymer 12.1 6.7 14.8 100%

Polycarbonate of bisphenol-A 60.6 2.9 5.8 100%

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : : : , - - . . . . . . . . . . . . . . : . . . . . . . - - -

............................................ Table 6b: Resu l t s for r a n d o m - C H 2 - groups

Avg. # of dimmers

Percentage found at

of runs Target Polymer the end of

successful the GA search

Avg. Avg. # of generation # monomers when first found at the solution was end of the found GA search

Polyethylene 14.7 8.5 8.1 100% terephthalate

Poly(vinylidene 12.4 6.9 13.9 100% propylene) copolymer

Polycarbonate of bisphenol-A

. . . . . . . : . . . . . . . . . . . . . . . . . . _

73.1 1.7 3.5 76%

124

The polycarbonate was the most difficult s t ructure to identify as ment ioned earlier. Consequently, it took more generations on average to discover this polymer. However, the genetic search did discover this polymer as well with 100% success rate for L=7 case. For the L=10 case, it was less successful (76%) when the initial population was a random collection of-CH2- chains. This was so because the members of the initial population were very different in their s t ructure from the target and hence it took longer to discover the correct groups and structure. It was observed tha t if the evolution were allowed to continue for 300 generations in the failed runs, the genetic search was able to discover the target in most cases. In the case of random groups initialization, some of the r ight groups (like benzene or OCO) were already present in the initial population. This gave a be t ter s tar t and hence a quicker search.

5.6 C O N C L U S I O N S

This chapter has i l lustrated the use of genetic algori thms or genetic p rogramming for computer-aided molecular design. A background of GAs, their theory and implementa t ion has been provided. Though the two test problems discussed are relatively small, they are sufficient to present a flavor of the uti l i ty of a genetic search method for CAMD.

As clearly demonst ra ted by the case studies, the genetic algori thm framework offers a number of advantages: first of all, it is a multiple point search technique tha t examines a set of solutions and not just one solution - this and the stochastic na ture of the algorithm helps the search to escape local minima traps. Secondly, it is not derivative-based and is therefore able to avoid the difficulties faced by math programming techniques in tha t respect. Fur thermore the framework allows relat ively easy expression of the rich and complex chemistry of molecules thus allowing easy integrat ion of whatever heuristic knowledge one might have about the problem, into the genetic framework to speed up the design process. This is i l lustrated in the larger polymer design case study discussed in chapter 13.

One can appreciate the significant advantage of having a mult i-point search in tha t regardless of whether the t rue target solution is located, a number of near-opt imal solutions are presented to the designer. This becomes par t icular ly significant for the design of molecules tha t are too complex for the forward predictions to be completely reliable. In such cases, one would like a range of design candidates tha t could be subjected to fur ther test ing with actual synthesis or experimentat ion in a laboratory.

A GA search strategy, no doubt, also suffers from some drawbacks. Mainly, the heurist ic na ture of the search results in no guarantee being

125

offered of finding the target solution. Secondly, the selection of good parameter values for a given problem requires some degree of experimentation. But then, these shortcomings are true of other heuristic approaches as well. And for a general nonlinear optimization search problem, the target i.e. the global optimum solution cannot be guaranteed in any case. Notwithstanding these drawbacks, the advantages of using a GA-based inverse strategy more than warrant its use as a design system. The appendix presents a bigger, more complex version of the polymer design problem wherein the merits of the algorithm become even more apparent. The study also briefly addresses issues related to parametric sensitivity and the robustness of the GA, which are of vital importance as far as the practical utility and application of the design system is concerned.

5 . 7 N O M E N C L A T U R E A N D A B B R E V I A T I O N S

N F cumf sf P(') E(-) 5 o

L fi~(t)

ft-(t) N~(t) Pc Pm O

W (z

CAMD GA(s) PET PVP PC

population size fitness cumulative fitness scaled fitness probability expected value defining length of a given schema order of a given schema (maximum) length of chromosome average observed fitness of schema ~ at time t

average fitness of the population at time t

number of members in schema ~ at time t probability of crossover probability of bit-mutation state or phenotype space genetic or genotype space decay rate for Gaussian fitness function slope parameter for sigmoidal fitness function Computer-Aided Molecular Design Genetic Algorithm(s) Polyethylene terephthalate Poly(vinylidene propylene) copolymer Polycarbonate of bisphenol-A

5.8 R E F E R E N C E S :

1. S. Macchietto, O. Odele and O. Omatsone, Chem. Eng. Res. Des., 68, 5 (1990) 429-433.

2. O. Odele, and S. Macchietto, Fluid Phase Equilibria, 82, 47 (1993).

126

~

~

~

~

~

~

9. 10.

11. 12.

13.

14. 15. 16. 17. 18. 19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

A. Duvedi and L. E. K. Achenie, Chem. Eng. Sci., 51 (1996) 3727- 3739. N. Churi and L. E. K. Achenie, Ind. Eng. Chem. Res., 35 (1996) 3788-3794. S. Siddhaye, K. V. Camarda, E. Topp and M. Southard, Comput. Chem. Eng., 24 (2000) 701-704. R. Vaidyanathan and M. E1-Halwagi, J. Elastom. Plast., 26, 3 (1994) 277. R. Vaidyanathan and M. E1-Halwagi, Ind. Eng. Chem. Res., 35 (1996) 627-634. C. D. Maranas, Ind. Eng. Chem. Res., 35 (1996) 3403-3414. C. D. Maranas, AIChE J., 43, 5 (1997) 1250-1264. K. Camarda and C. D. Maranas, Ind. Eng. Chem. Res., 38 (1999) 1884-1892. K. G. Joback and R. C. Reid, Chem. Eng. Commun., 57 (1987) 233. R. Gani, N. Tzouvras, P. Rasmussen and A. Fredenslund, Fluid Phase Equilibria, 47, 2 (1989) 133. D. W. van Krevelen, Properties of Polymers; their Correlation with Chemical Structure; their Numerical Estimation and Prediction from Additive Group Contribution, 3 rd Ed., Elsevier, Amsterdam, The Netherlands, 1990. L. Constantinou and R. Gani, AICHE J., 40, 10 (1994) 1697. L. B. Kier, Quant. Struct.-Act. Relat., 4, 109 (1985). L. B. Kier, Quant. Struct.-Act. Relat., 5, 1 (1986). H. Weiner, J. Am. Chem. Soc., 69, 17 (1947). M. Randic, J. Am. Chem. Soc., 97, (1975) 6609. A. Meniai and D. M. T. Newsham, Trans. Ind. Chem. Eng., 70, Part A (1990) 78-77. P. M. Harper, R. Gani, P. Kolar and T. Ishikawa, Fluid Phase Equilibria, 158-160, (1999) 337-347. P. Ghosh, V. Venkatasubramanian, J. M. Caruthers and A. Sundaram, Comput. Chem. Eng., 24 (2000) 685-691. K. Nagasaka, H. Wada, H. Yoshimitsu, H. Yasuda and T. Yamanouchi, AIChE Annual Meeting 39e, Chicago, IL (1990). R. Gani, B. Nielsen and A. Fredenslund, AICHE J., 37, 9 (1991) 1318. G. Bolis, L. D. Pace and F. Fabrocini, J. Comput. Aided Molecular Design, 5 (1991) 617-628. E. V. Gordeeva, M. S. Molcharova, and N. S. Zefirov, Tetrahedron Comput. Methodol. 3, 389 (1990). L. B. Kier, H. Lowell and J. F. Frazier, J. Chem. Inf. Comput. Sci., 33, 142 (1993). G. C. Derringer and R. L. Markham, J. Appl. Polym. Sci., 30, 4609 (1985). K. G. Joback and G. Stephanopoulos, Proc. FOCAPD, Snowmass, CP, (1989) 363. M. Skvortsova, I. I. Baskin, O. L. Slovokhotova, V. A. Paulin and N. S. Zefirov, J. Chem. Inf. Comput. Sci., 33, (1993) 630-634.

127

3 0 . J . H . Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI, 1975.

31.D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 412, 1989.

32.D. Davis (Ed.), Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, 1991.

3 3 . G . J . E. Rawlins (Ed.), Foundations of Genetic Algorithms, Kaufmann Publishers, San Mateo, CA, 1991.

34. K. F. Man, K. S. Tang and S. Kwong, Genetic Algorithms: Concepts and Designs, Springer, London, 1999.

35.L. Booker, Improving search in genetic algorithms, in Lawrence Davis (Ed.), Genetic Algorithms and Simulated Annealing, Pitman, London, 1987.

36.L.J . Eshelman, R. A. Caruana and J. D. Schaffer, Biases in the crossover landscape, in Proc. Third International Conference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA, 1989.

37.G. Syswerda, in Proc. Third International Conference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA, 1991.

38.W. Spears and K. A. De Jong, in Proc. Fourth International Conference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA, 1991, 220-236.

3 9 . J . J . Grefenstett, IEEE Trans. Systems, Man and Cybernetics, SMC-16, 1 (1986) 122-128.

40.N. Radcliffe, Annals of Mathematics and Artificial Intelligence, 10 (1994).

41.A. Nix and M. D. Vose, Annals of Mathematics and Artificial Intelligence, 5, (1991) 79-99.

42.M.D. Vose and G. E. Liepins, Complex Systems, 5, (1991) 31-44. 43.D. Whitley, An executable model of a simple genetic algorithm, in

D. Whitley (Ed.), Foundations of Genetic Algorithms 2, Morgan Kauffman, San Mateo, CA, 1992.

44.J.M. Fitzpatrick and J. J. Grefenstette, Machine Learning, 3, 2/3 (1988) 101-120.

45. C. Bridges and D. E. Goldberg, in Proc. Second International Conference on Genetic Algorithms, Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1987.

46.N.J. Radcliffe, Complex Systems, 5, 2 (1991) 183-205. 47.N.J. Radcliffe, in Proc. Fourth International Conference on Genetic

Algorithms, Morgan Kauffman, San Mateo, CA (1991) 222-229. 4 8 . N . J . Radcliffe, in D. Whitley (Ed.), Foundations of Genetic

Algorithms 2, Morgan Kauffman, San Mateo, CA, 1992. 49 .N.J . Radcliffe, in R. M~inner and B. Manderick (Eds.), Parallel

Problem Solving from Nature 2, Elsevier Science Publishers, North Holland, Amsterdam, 259-268, 1992.

50.V. Venkatasubramanian and A. Sundaram, in Encyclopedia of Computational Chemistry, John Wiley and Sons, 1997.

128

51.

52.

53.

54. 55.

56.

V. Venkatasubramanian, A. Sundaram, K. Chan and J. M. Caruthers, in J. Devillers (Ed.), Genetic Algorithms in Molecular Modeling, Academic Press, London, 1996, 271-302. V. Venkatasubramanian, K. Chan and J. M. Caruthers, Comput. Chem. Eng., 18 (1994) 833-844. R. C. Glen and A. W. R. Payne, J. Comput. Aided Molecular Design, 9 (1995) 181-202. J. Devillers, J. Chem. Inf. Comput. Sci., 36 (1996) 1061-1066. P. H. Winston and B. K. P. Horn, LISP, Second Edition, Addison- Wesley Publishing, 1984. K. G. Joback and G. Stephanopoulos, FOCADP '89, Snowmass, CO, 1989.

Computer Aided Molecular Design: Theory and Practice L.EK. Achenie, R. Gani and V. Venkatasubramanian (Editors) �9 2003 Elsevier Science B.V. All fights reserved. 129

C h a p t e r 6" A H y b r i d CAMD M e t h o d

P. M. Harper, M. Hostrup & R. Gani

6. 1 INTRODUCTION

As in any design problem, the design process in CAMD also needs to generate and evaluate alternatives in order to find the desired chemical product. In the case of CAMD, the alternatives are chemically feasible molecules (or mixtures of molecules) and the feasible candidate molecules (or mixtures) are those that satisfy the design specifications represented by a set of property constraints. This chapter describes a framework for a hybrid CAMD method. The design process, according to this framework is divided into three phases.

�9 The pre-design phase - definition phase of the CAMD problem.

�9 The design phase - solution phase of the CAMD problem in terms of generation of feasible candidates.

�9 The post-design p h a s e - analysis phase of the CAMD where the where the final selection is made.

Figure 1 illustrates the principal ideas behind this framework through a simple CAMD problem where functional groups are used as the building blocks for generating feasible molecular structures.

Pre-design ~ Design ( S t a r t ) - -

"1 want acyclic alcohols, ketones,

aldehydes and ethers with solvent properties

similar to Benzene"

Interpretation to ~put/constmints

A set of building blocks: CH3, CH2, CH, C, OH, CH3CO, CH2CO, CHO,

CH30, CH20, CH-O +

A set of numerical constraints

A collection of group vectors like:

3 CH3, 1 CH2, 1 CH, 1 CH20

All group vectors satisfy constraints

2.order"" CH2_ CH .......... ; ;OH3. group CH/ " ~ O / ~'CH /

.... ~ ":(~H3 : ...' -..... ....

.... CH 3

I Group from CH2 ....... .--.. C H ~ OH 3 other GCA C.3, / ~O~.. . :~ CHz- /

method ~;~,-

Design (Higher levels)

Refined property estimation. Ability to estimate additional properties or use

alternative methods.

Rescreening against constraints.

=.-~l iv-~l Start of Post-design .................

CH 2 CH 2 CH 3 CH~ ~o ~ "CH ~

I CH3

CH3

I CH 2 CH_ OH 3

cHS xo / "cHS

Figure 1: Illustration of the CAMD framework

130

The application of the framework illustrated in Figure 1 requires a number of methods and tools that need to be integrated in order to provide a flexible, reliable and robust solution to a large range of CAMD problems. Figure 2 highlights the architecture of such a hybrid CAMD method. In this chapter, the term "product" will be used to mean molecules as well as mixtures.

Problem Specification

.......

Pfe-design [ phase I Constraint Selection

I Compound: i. t "" . . . . }Identification F

- - Database Approach I Approach Le~,ei 3 J ........................... , e v e , 4 � 9 ............. l::

( D a t a b a s e s t

Result Analysis and I * - " " " --"~lec~ar" Verification ~ . . . . " ~ . , Mod elli?ng j}

P o s t - d e s i g n I " " , , , " ~ " _ . . "

p h a s e , �9 , "-- - - t . a . ~ . Q t I Candidate Selectionl ( Ext~rllil I . . . . I _ _ - J .~ Tools I

Figure 2: The hybrid CAMD method and framework for integration

Design phase

6.2 PRE-DESIGN PHASE

The CAMD formulation in terms of design specifications is performed in a pre-design phase where the CAMD problem is described in terms of identified design goals, desired molecule type(s) and properties. As shown in Figure 2, this pre-design phase consists of a problem specification step and a method & constraint selection step, which includes an algorithm for problem formulation.

6.2.1 CAMD Prob lem Spec i f i ca t ion

The design process starts with a definition of the basic needs (or ult imate goals). The type of goal may influence many of the design decisions that will need to be made during the later phases of the CAMD problem

131

solution. The goal should describe the function of the desired chemical product, the environment /equipment where the function should be performed as well as the capabilities tha t are desirable/undesirable. For example, in the case of design of solvents, the desired solvent mus t dissolve a specified solute(s), it must be selective if other soluble solutes are also present, it mus t not cause a negative environmenta l impact and it should be easy to recover. The description of the goals of CAMD can be of different t y p e s - a few examples are given below.

"Find a solvent suitable for removing phenol from a waste water stream by liquid-liquid extraction. The solvent should pose a low health risk for the users, should be environmentally friendly and could be a single molecule or a mixture." This is an example of a well-defined problem as almost all necessary details are given. From the specified details the properties tha t are needed (such as solubility, EH&S properties, liquid immiscibility, etc.) can be identified. The goal values for the properties are not given but if the objective is to find the best solvent, then it mus t have the highest solubility and the least environmental impact.

"Identify a molecule(s) with the same pure component properties as benzene, such as normal boiling point, normal melting point, octanol-water partition coefficient, solubility parameter as that of benzene but with a much lower environmental impact in the work place." Again, this is a well- defined problem with even the goal values given because the property values of benzene are already known.

"Find a solvent to be used for washing off an equipment (for example a printing press) which is environmentally friendly and cheap." Here, the problem is not very well defined because while some of the constraints are defined, one piece of impor tant information is m i s s i n g - what should be dissolved by the solvent? Fur thermore the definition of 'cheap' depends on the process involved as well as the current solution used.

"Find an additive (molecule or mixture) for a tape so that the tape will stick to a painted surface for a year and then can be removed without pull ing off the paint." This is another example of a not very well-defined problem because we need more information on the glue tha t will be added to the tape as well as the various compositions of paints where the tape will stick. The main question here is which properties are we looking for and wha t are their goal values?

"Find a molecule that will have inhibition activity against Alzeheimer's disease." Problems of this type, al though very well defined in te rms of property, are difficult to solve because of the potential search space. If, however, we add the fo l lowing-"search only among the isomers of X ~ ' where "XX" is a par t icular molecular t y p e - then we have a well-defined problem, even though the number of possible isomers may be quite large

132

and prediction of the inhibition activity as a function of the molecular s tructure may be quite difficult.

Find all molecules that form an azeotrope with ethanol at a pressure of 1 atm. This is not a typical product design problem. CAMD, however, can also solve problems like this. It is not well defined because the search space is potentially very large. However, if we select a molecule type (for example, acyclic hydrocarbons of molecular weight less than 100), then the problem becomes well defined.

The above examples highlight the need for a knowledge-based system that can identify the needed properties from the general problem specifications presented above. Once the properties have been identified, their goal values need to be specified and methods for obtaining the necessary property values need to be selected. That is, the qualitative problem specification needs to be transformed into a quantitative problem specification.

6.2.2 Method & Constraint Se lec t ion

The objective of this step is to transform the qualitative problem specification from the previous step into a quantitative form that is suitable for CAMD problem solution during the design phase. The quanti tat ive problem specification consists of the following:

�9 Identify the needed p r o p e r t i e s - this matches the qualitative specification with behavior (properties) of the chemical product.

�9 Identify the goal values of the needed properties - this matches the actual goal of the product with respect to its function and behavior.

�9 Identify the methods for obtaining the property values - this determines how the property (behavior) of the product will be obtained.

�9 Identify the building blocks for generation of molecular structures or candidate chemicals for mixture d e s i g n - this determines the search space and the scale of the molecular structural model.

In order to assist in the transformation of the qualitative problem specification into a quantitative one, use of a knowledge base, can be very useful. A knowledge base, particularly suitable for applications involving solvent-based separation processes, is highlighted below.

K n o w l e d g e base

The objective of this knowledge base is to assist in the transformation of general qualitative solvent design problem specifications into quanti tat ive ones tha t are suitable for CAMD problem solution.

133

The information contained in the knowledge base is ordered as a hierarchical system with the application types of the solvent-based process at the top and the properties and property values at intervals of specified conditions of temperature, pressure and/or composition at the bottom. Figure 3 illustrates a section of the information tree belonging to this knowledge base. It can be noted that the property entries in the information tree in Figure 3 have three branches:

Essential Properties The properties in this branch are essential for the function of the desired product and is most often either related to the phase behavior of the molecule or the driving forces for the separation operation the molecule is intended for. For example, the constraint that the molecule must be in the liquid state at the operational temperatures of the process creates an essential requirement that the boiling point of the molecule is above the operational temperature while the melting point is below. Also, if the molecule is to be used as a solvent for liquid-liquid extraction, it must cause a phase split and have a density different from that of the solutes.

Figure 3: Partial information tree of the knowledge base

Desirable Properties: Desirable properties are related to the performance or efficiency of a product in a specified application. The

134

product may still be acceptable if these properties are not matched. They become important during the selection of the feasible candidates and during performance evaluation in order to determine the optimal design. As a rule of thumb fixed lower or upper limits cannot usually be set for these desirable properties. Generally, the aim is to have the highest or lowest possible value for the identified desirable properties. An example of a desirable property is the selectivity towards a specified solute that must be extracted from a mixture with other solutes through a solvent-based extraction process such as liquid-liquid extraction. For convenience, the undesirable properties are also be included in this class of properties.

EH&S and Special Properties: These properties are associated with the performance of the product in a specific operation or function and its effect on the surroundings (or environment) as a result of their use and emission. These properties may be specified as essential, desirable and/or undesirable. However, they are placed as a separate class because methods for their direct estimations are usually not available. Consequently, they may be considered in the post-design phase through database search or even through direct or indirect experiments. In this way, this type of potentially expensive analysis is reserved only for those candidates that satisfy all other product criteria. Note that some of the essential and desired products may implicitly also satisfy the EH&S and special property constraints. Examples of the special properties are those related to, for example, smell, color and taste.

Each property branch is divided into a pure properties and a mixture properties leaf. The pure properties are further divided into pr imary properties, secondary properties and functional properties (this is not shown in Figure 3) while mixture properties belong to the class of functional properties (see Chapter 1). Note that some mixture (functional) properties such as solubility may be calculated as a function of pr imary properties while some other functional properties and secondary properties may be calculated as a primary property. For example, if a rigorous model for estimation of solute solubility is not available, the necessary property values may be estimated through solubility parameters . However, since solubility parameters, by definition is both a functional ( temperature dependence) and a secondary property (function of molar volume and heat of vaporization), it becomes a primary property if the tempera ture is fixed to 298 K and if it is directly correlated as a function of molecular structural parameters. The knowledge base contains this information and is useful when a needed model of one type is not available.

In the case of functional properties, the CAMD problem specification needs to specify the range of conditions where these properties must be

135

matched, that is, the intervals of conditions of operation as a function of temperature, pressure and/or composition.

In addition to the information contained in the partial information tree of Figure 3, the knowledge base may also include tabular data linking a particular CAMD problem type with corresponding properties, linking properties to EH&S analyses as well as data related to the CAMD problem type and the phenomena involved. Three examples of such tabular data are given through Tables 1, 2 & 3.

Table 1: List of separation techniques and their corresponding separation phenomena

Separat ion t e c h n i q u e Crystallization Distillation Distillation plus decanter Extractive distillation Azeotropic distillation Liquid-Liquid extraction Super-critical extraction

"defined by class and phases involved) Class

Property difference Property difference Property difference

Solvent-based Solvent-based Solvent-based

Solvent-based

Phases i nvo lved

Solid-Liquid Vapor-Liquid Vapor-Liquid Liquid-Liquid Vapor-Liquid Vapor-Liquid-Liquid Liquid-Liquid

Fluid-Vapor-Liquid

In the knowledge base (Table 1) the properties important to the function in a particular application are listed along with the relative property differences needed to perform the function (column 2 of Table 1) and the associated phases involved in the particular application (column 3 of Table 1). In the knowledge base for essential and desirable properties as a function of application type (Table 2), the listed properties should only be used as a start ing point. Other properties may need to be added and some of the listed properties may need to be removed depending on the particular CAMD problem specifications. The EH&S properties listed in Table 3, are given as general guidelines based on the phases involved in the applications listed in Table 2. Note that the consideration of EH&S properties is often dependent on the entire process (how the solvent- product is handled and the possible routes of discharge to the environment). Nevertheless, the consideration of EH&S related properties on a unit operation level can address work place health and safety issues associated with non-routine releases as well as make it possible to use more rigorous approaches to environmental impact minimization (see also section 6.4.2).

136

Table 2: List of important properties for some separation techniques Properties Solvent Design

L-L Extractive Azeotropic Solid Gas Distillation Distillation Separation Absorption Extraction

P u r e E D E 4 4 q

4 T 4 4 T m 4 N /

P pV ~/

H vap

M i x t u r e

D E D E D E 4 4 4 q

'/ 4 ,/

Selectivity SL SP DC Phase-split Azeotrope

Pm

gm

4 q 1 E D E D E D E D

4 ,4 4 4 q ,/ ,4 4 q 4 ,/ q 4

4 ,4 4 4 4 4

D ,/ q 4

E ,/ ,/ ,4

D

H 4 Note" E is Essential; D is Desirable; L-L is liquid-liquid; the definitions of property variables in column 1 are given in Nomenclature.

Table 3: List of properties for addressing EH&S considerations Properties Environmental Concern

Health Safety Environment Implicit Toxicity

Biological persistence Chemical stability Reactivity

Explicit Biodegradability ~] ~] Pv ~/ ~/ "4 H (in water) ~] Log P ~] ~] Log W s ~] ~] Flash point ~] BOD ~] p (vapor) ~] ~] Evaporation rate ~] ~] ~] LD50 ~] ~] ODP

137

P r o b l e m F o r m u l a t i o n Algor i thm

The objective of the problem formulation algorithm is to transform the qualitative problem specification into a quantitative one through a combination of the use of knowledge base, insights and experience. It is a multi-step process requiring different levels of information. A step-by-step algorithm that may be useful for CAMD problem formulation is given below. The corresponding representation of the algorithm as a block diagram is shown in Figure 4.

�9 List the unit operations to be considered. �9 For each unit operation:

o Retrieve the known properties of the compounds the designed compound is to be used with.

o Obtain the operational ranges of temperature and pressure along with the composition ranges for the compounds in the system.

o Identify the property models available for estimation of the needed pure and mixture properties.

o Extract the list of relevant pure and mixture properties from the knowledge base for the unit operation. If the selected property models from the previous step are unable to estimate the needed properties, consider either adding a new model or estimating a similar property that can be estimated reliably.

o If any of the design properties require information about the other compounds in the system in order to set up the target values, compare the requirements with the list of known properties obtained from above. If some requirements cannot be fulfilled the properties are removed from the set of design criteria.

�9 Create a superset of criteria by combining the sets of identified properties for each of unit operations.

�9 For each of the properties in the superset create the target ranges (the design constraints) by combining the property intervals identified for each of the unit operations and uses. The identified property intervals represent the design criteria satisfying the requirements of all the operations examined.

�9 List the methods available for predicting the required properties. �9 List the molecule types that can be handled by the property

prediction methods and the predictive thermodynamic models. �9 From the list of compound types for which property prediction

methods exist create the list of building blocks used to create/assemble the molecules in the design phase.

138

Figure 4: Block diagram of the problem formulation algorithm

The result of the problem formulation algorithm is:

�9 A list of building blocks to use (e.g. CH3, CH2, CH, OH, COOH). �9 A set of inequality constraints based on pure component properties. �9 A set of inequality constraints based on mixture properties (along

with information regarding the conditions at which to evaluate the properties).

�9 Information regarding the methods (pure and mixture) available for the evaluation of the constraints.

A database containing information on, type of molecules versus building blocks (for example, groups) and type of molecules versus specific EH&S properties helps in the problem formulation. For example, functional groups (building blocks) such as "OH" and "COOH" must exist in alcohols and acids, respectively. Therefore, selection of molecular types such as alcohols and acids could be linked to automatic selection of "OH" and "COOH" functional groups in the set of building blocks. Similarly, aromatic compounds are likely to be carcinogenic while chlorides may cause corrosion and have a negative impact on environmental indicators. Therefore, choice of these EH&S properties as constraints means automatic exclusion of the corresponding compounds and therefore, their corresponding building blocks. The first two steps in Fig. 1 also highlight this feature. For the specified type of desired molecules, the corresponding

139

building blocks have been selected. A good exercise for the reader would be to consider the groups tables given in chapters 2 and 4 and prepare a table of molecular type versus groups (building blocks). An example for such a table is given below in Table 4 for simple mono-functional molecules.

Molecule Type Acyclic hydrocarbons Aromatic hydrocarbons

Table 4: Molecule type versus groups Groups (building blocks) CH3, CH2, CH, C

CH3, CH2, CH, C, ACH, AC, ACCH3, ACCH2, ACCH

Alcohols CH3, CH2, CH, C, OH Ketones CH3, CH2, CH, C, CH3CO, CH2CO Esters CH3, CH2, CH, C, CH3COO, CH2COO, HCOO, CO0 Acids CH3, CH2, CH, C CH3, CH2, CH, C, COOH

The information related to the quantitative CAMD problem specification is now passed to the next phase of the design process, that is, the design phase of the hybrid CAMD method.

6.3 D E S I G N P H A S E

Given the quantitative problem specification, the objective of the design phase is to apply a suitable method for generating the feasible candidates. Here, the feasible candidates can be a set of molecules (or mixtures) that satisfy all property constraints and/or the molecule (or mixture) that not only satisfy the constraints but also reflect the optimal performance. Whether it is a set of candidates or an optimal candidate (or a set of local optimal candidates) depends on the CAMD algorithm used in this design phase. In principle, any of the CAMD methods described in Chapters 2-5 & 7 can be used in this design phase.

The hybrid CAMD method described in this section employs successive generate & test approaches ordered in a hierarchy based on the level of molecular structural information used and the corresponding property estimation method. The properties are also ordered according to a hierarchy where the primary pure properties are estimated first, followed by secondary pure properties, followed by functional pure properties, and finally, the mixture properties. Note that the implicit EH&S properties and the implicit special properties are analyzed in the post-design phase in this hybrid CAMD method. In the CAMD solution approach of the generate & test type, all feasible molecules are generated from a set of building blocks and subsequently tested against the design specifications to screen out the alternatives that do not fulfill the requirements. The so- called combinatorial explosion problem associated with CAMD algorithms in general and generate & test approaches in particular is avoided through the employed multi-level approach. That is, through successive steps of

140

generation and screening against the design criteria, the level of molecular detail is increased only on the feasible candidates and not on all possible combinations.

6.3.1 Hybrid Generate & Test CAMD Algor i thm

The hybrid generate & test based CAMD algorithm has four levels. Each level has its own generate & test algorithm. Higher levels use additional molecular structural information compared to lower levels. The fundamental basis for the developed algorithm is the continuous refinement of the results obtained from each level.

The lower levels have a low computational complexity (i.e., it is possible to generate a large number of alternatives without excessive calculations) but do not in all cases generate all the information necessary to perform the estimation of the important properties. The higher levels are more complex and cannot handle a very large number of alternatives without application of a significant computational effort. Consequently, the design strategy of the developed algorithm is a hybrid approach where the lower levels are used to "pick out" promising candidates from the search space while the higher levels use the output from the preceding level as input. The net effect of this approach is that the results are refined from level to level without spending computational resources on candidates, which are unable to fulfill the requirements. In outline form the characteristics of the levels are:

�9 Level 1 generates group vectors by combining groups from a basic group-set (for example, the UNIFAC first-order g r o u p s - see the groups sets also used in chapters 2 and 4). Based on the equations and feasibility considerations given by Harper (2000), the algorithm generates all the feasible molecular representations without suffering from combinatorial explosion. The testing of the generated molecules against the design criteria is performed using methods based on the Group Contribution Approach (GCA).

�9 Level 2 takes the results from level 1, that is, the molecules surviving the test step of level 1 and combines the members of each group vector to form new molecules (including isomers).

�9 Level 3 brings the molecules out of the (pseudo) macroscopic group representation from level 2 into a microscopic (atom-based) representation by replacing the group information with the equivalent atomic information.

�9 Level 4 expands the microscopic information by adding a 3- dimensional representation to the results from level 3.

This multi-level procedure is illustrated through an example in Figure 5. Note that entry is possible at any level as long as the appropriate data is available.

141

Figure 5: Illustration of the 4-level CAMD hybrid method

6.3.2 L e v e l 1: G e n e r a t i o n of group v e c t o r s f rom f irs t -order g r o u p s

Level 1 generates vectors of groups (fragments) by combining groups from the first-order group-set. These sets are capable of forming at least 1 feasible molecular structure. Simultaneous calculation of re la ted properties ( that are dependent only on first order groups) and screening of the generated s t ructures are performed in order to control the problem size and execution time. The algori thm here is based on the group classification work of Gani et al. (1991) but uses a different and more efficient method of group vector assembly. The main features of the new algor i thm are:

�9 Building blocks are classified according to type. �9 Feasibi l i ty rules are based on the number of groups from a specific

class a compound may contain. �9 Valence rules are used to determine the number of groups with

1,2,3 & 4 connections to be used in molecule s t ructure generation.

G e n e r a t i o n a l g o r i t h m for l eve l 1

The main steps of the level 1 algori thm are i l lus t ra ted through Figure 6. By using equation A.3 (see appendix A of this chapter) repeatedly in conjunction with the classification system and the feasibility rules it is possible to only generate compounds (group vectors) fulfilling the feasibility requi rements (i.e. no compounds are generated and

142

subsequent ly discarded due to violation of the feasibility requirements) . The algori thm for generation of feasible compound representat ions is:

o

2.

.

4. 5.

.

Set C (the collection of designed compounds) equal to 0. Set Pc, v (the collections of compound sub-blocks from different

classes and categories), where c = P, S, D, T, Q; and v = 1, 2, 3, 4, 5; equal to 0. Give list of building blocks (including the classifications). Select compound type (acyclic, cyclic or aromatic). Give maximum ( K x ) and minimum (Kin) number of groups in a

compound. For all K (K = Kmi n ;K~ax):

a. Find all integer solutions (V#K; i = 1; IK) to equations A.4 &

A.5. b. For all solutions V/;K; i = 1; IK:

i. Find all integer solutions (Gi, j ; j - 1, di) to equations

A.6-A.11 as given in Appendix A. ii. For all solutions Gij ; j = 1; Ji:.

A. For each n c, v where c = P; S; D; T; Q; and v = 1;

2; 3; 4; 5; perform a lookup in Pc, v, to see if

results are present for the nc, v, key. If not, find

all possible combinations when selecting nc, v,

groups from the collection of available groups Nc, v, (the number of combinations) where c =P;

S; D; T; Q; and v = 1; 2; 3; 4; 5; and store the combinations in Pc, v, under the nc, v, key.

B. Find all combinations of the entries in P under the nc, v keys from Gi; j. Add each solution to Ci

(the number of combinations can be calculated by equation A.12).

iii. Screen C i against the property constraints tha t can be

handled in level 1 (see the next section for details) discarding any compound not fulfilling the requirements.

iv. Add the surviving compounds from C i to C.

7. S e t K = K + 1. 8. If K < Kma ~ go back to 6 else continue.

9. STOP

143

Unorclerect ~et of

{ Sets of group classes:

�9

I

~ Rules relating / / ~ reasibilib/to classification

/ Classifiaction

system

I

Additional specifications

Maximum n,Jmber of 9roup~:, / Ring formation allo~,ed or not. /

Determine now many groups with 1,2,3 & 4

connections are needed

1 Example: i Total number or groups 8: 2 Group with 3 connectionsJ 4 Groups with 1 connection| 2 Groups with 2 connection~,

/ I

1 1 Generate all possible

combinations following the rules and specifications,

RESULT:

4 OH3, 1 OI42, 2 OH,

1 CH2COO

3 OH3, 1 CHO, 1 CH2, 2CH,

1 CH2COO

I

I !1 I I _ 1 I

I

Figure 6: Illustration of level-1 generation

After a successful run of this generation and screening algorithm, the net result is a collection C of vectors of groups describing a series of molecules all satisfying the property constraints that can be examined at level 1. Note two very important features of this method:

1. The screening is embedded into the generation algorithm. This is done in order to identify and remove undesirable candidates at an early stage and thereby conserving storage resources.

2. The created candidates can be represented as a vector of length 4 with each element pointing to a sub-vector in Pc, o. By using this

approach information is not duplicated unnecessarily.

An example of the application of the level 1 algorithm is highlighted in Figure 6 while the block diagram of the algorithm is given in Figure 7.

144

: l~ p;o.e,. F - ' - - - ~ . . "o~k;J specific ionditions

] Obtain rule set [ ~ ~, ~'M,n' Ix~.x I)

. . . . . . . . . . . . ; , ,, 1 " ' " - _. ~

(cla:i~s,a!!:dn ~ ~ [ Solve equations F....

...... ~ "" "",l,I 1 solutions '' "1 'n nsnon: O ' , . . . . . . . , ,, ~1 t . . . .

. . . . . 3 2 ~ 3 : , ~ . 1"--_ ~" '"J iol'u tio ns G= n c .

~ for j=l J "~ ..... c=P,S,D,T,Q"

Find all combinations when selecting nc.~. groups from N c ,,

4, Combine the results to form

compounds and screen against design criteria

....... ~

. . . . . . . . .,Z

Figure 7." Block diagram for algorithm of level-1 generation

Propert ies Handled in Level-1

The properties handled in level-1 are group contribution methods based on the group-set (in particular the groups used in the methods of Constantinou and Gani (1994) or Marrero and Gani (2001)) as well as correlations based on properties predicted by group contribution. Here the issue of property trust (defined in chapter 1) comes into play. By using the results from property prediction methods in correlations for other

145

(secondary) properties in order to further expand the property range (defined in chapter 1) of the predictions, the property t rus t is diminished because of the risk of error propagation. At this level of the hybrid CAMD method it is not possible to improve the property t rust by using experimental data as the input for the correlations. This is because the molecular structures are ambiguously defined and it is therefore not possible to perform lookups to external sources of data in a fast and easy way.

6.3.3 Level-2: G e n e r a t i o n of S t r u c t u r a l I s o m e r s F r o m Group V e c t o r s

This level generates new molecular structures by combining elements of the individual fragment sets of the group vectors from level-1. First- and second-order groups (such as those defined by Constantinou and Gani, 1994) are considered in the calculation of properties in this level. The main feature of this algorithm is that it is pseudo recursive. That is, all allowed combinations are considered, and, efficiency is maintained by continuous removal of duplicate structures. Also, the combination rules satisfy conditions of chemical feasibility.

G e n e r a t i o n of s t r u c t u r a l i s o m e r s from group v e c t o r s

The results obtained from the level-1 generation and screening algorithm are vectors of groups. Each vector can theoretically represent a number of different s tructural isomers. In Figure 1, the generation of isomers from the collection of group vectors is highlighted for the case of a group vector consisting of 3 CH3, 1 CH2, 1 CH, 1 CH20. With the help of 2nd-order groups, two isomers are highlighted.

The goal of the generation in level-2 is to:

Increase the dimensionality of the molecular model in order to bring the results closer to the end goal of 3D structures. Provide a foundation for improving the quality of the predicted properties as well as allowing estimation of properties that cannot be handled when considering molecular model consisting only of first-order functional groups (groups from level-I).

The generation is performed by combining the groups from each of the results (group vectors) from level-1 into connected graphs with groups as vertices and bonds as edges. Special care must be exerted when combining non-symmetrical groups with more than one free connection (as shown in Figure 8). The method for handling such groups is to split up the group into a sub-graph as shown in Figure 9. Because of the need to be able to handle non-symmetrical groups the generation is in fact the combination of a collection of sub-graphs (Figure 10), most of which only have one vertex, into a connected graph. When considering the generation of acyclic

146

compounds the problem is that of generating all spanning trees in a graph with the added constraint of restrictions on the valence of each of the vertices. An example of a base-graph is shown in Figure 11.

In Figure 11, the creation of the base-graph (the graph the spanning tree is to be created in) has not been completed. This is due to the requirement tha t compounds should be chemically feasible and also adhere to the rules of application of first-order groups. In order not to generate multiple identical compounds from different group vectors and in order to ensure tha t the "promotion" into chemical structures is "reversible". The requirement of reversible promotion can be addressed by defining rules for how groups can be combined/connected. The rules imposed cause the base- graph to be incomplete in all but the simplest cases (all groups belong to category 1 of the group classification system). An easy additional simplification can be applied for all group vectors having more than 2 groups.

Since the utilization of all groups is required it is obvious to disallow connection between groups having only 1 free connection. The result of the application of the feasibility rules and simplifications is i l lustrated by Figure 12. In the base-graph storing the allowable connections, the valence restriction of the groups is not fulfilled since it is a map of all molecules superimposed onto each other creating a molecule superstructure representing all possible combinations (in the same way as a flowsheet superstructure represents a number of process options in process design formulations). If a molecule base-graph (or super-structure) meeting the valence requirement exactly is found, there is only one way of combining the groups into a molecule.

Once the molecule superstructure has been determined the task is to identify all the spanning trees in the superstructure with the constraint tha t the valence requirement of each group must be fulfilled for all the identified spanning trees (Figure 13 shows an example of such a spanning tree). The identification of all spanning trees in a graph is a complex problem even without considering the valence of the individual groups.

While the above t rea tment of the isomer generation as a tree building process only covers the generation of acyclic compounds it is a simple task to extend the concept to the generation of cyclic structures by relaxing the valence requirement for all groups with a valence greater than 1 in the generation of the spanning trees. The ring forming process is then performed after the tree identification by connecting vertices with free connections.

As an added requirement to the problem of identifying the spanning trees and later the rings in cyclic molecules is the necessity of generating unique structures only and avoiding graph isomorphism (the problems

147

related to graph isomorphism are described in Raman and Maranas (1998)).

Figure 8: Non-symmetrical group having more than I free connection

Figure 9: Sub-graph created by splitting a non-symmetrical group

Figure 10: The collection of sub- graphs that are to be combined into

molecules Figure 11: The base-graph in which spanning trees are to be found, not

considering feasibility

Figure 12: The base-graph from figure 11 after application of

simplifications and feasibility considerations

Figure 13: Example of a valid spanning tree (a molecule)

Generat ion algorithm for level-2

Based on the discussion above, the methodology applied to identify the spanning trees is a recursive tree building process with repeated pruning used to remove branches leading to false solutions or duplicate structures (see Figures 14 and 15).

148

Figure 14: The generation tree obtained by applying the generation algorithm for level 2 (for an acyclic molecule)

Figure 15: Partial generation tree obtained by applying the generation algorithm for level 2 (for a cyclic molecule)

The algorithm is as follows:

Set the list of generated compounds (C) to O. Each compound C

holds a list of the free connections available (F) and information

about which groups have been used to make a connection.

149

10.

2. Create C O by selecting a start ing group and marking the group

"used". 3. Add the free connections of the group to F o

4. For all compounds in C: a. Select a compound Cj from C.

b. For all free connections in F~ �9

i. Select a free connection U from F J

ii. For all unused groups in C~"

A. Compare U with the connections for the unused group. If connection is allowed create a copy of Cj

and add the copy to C as C z.

B. In Cz: Connect the unused group in question,

mark the group as used, delete U from F z .

C. Add the free connections of the recently used group to F z .

c. Delete C from C.

5. If all groups have been u s e d - Go to 9 6. Compare all members of C and remove duplicates. 7. Remove all compounds having no free connections (false solutions). 8. Go to 4 9. If cyclic compounds are to be generated form these by creating all

possible variations by connecting the remaining free connections. STOP

Calcu la t ion of Proper t i e s in Level-2

Since the generation algorithm creates structures larger than the individual groups selected as building blocks it is possible to calculate properties using methods operating on structural descriptors that are assembled from the initial groups. An example of such a method is the second-order group contribution method of Constantinou and Gani (1994), and Marerro and Gani (2001) where the properties are predicted by summing up contributions from first-order groups as well as larger substructures (second-order groups) in the compounds with first-order groups as their building blocks. The identification of the existence of second-order groups in a structure created in level-2 can be performed by a p a t t e r n m a t c h i n g algorithm in which the generated adjacency matrix is examined for the presence of a smaller adjacency matrix (representing the second-order substructure). By performing this check for all second-order substructures it is possible to obtain the second-order description (a vector of the second-order groups present in the molecule) and thereby predict the properties using methods such as the Constantinou and Gani (1994) method. It is notable that since the same first-order (or level 1) description can be regarded as the "parent" of molecules having different second-order descriptions, the methods used in level-2 not only improve the quality of the property prediction but also allow for distinction between isomers.

150

Now that new isomers have been generated, a property estimation method employing this molecular representation may be employed to estimate the properties again as well as to estimate other new properties (as highlighted in Figure 1).

6.3.4 Level-3: Creation of Atomic Based Adjacency Descript ions

In level-3 the compound descriptions obtained from level-2 are subjected to further refinement and structural variation. The goals of level-3 are to bring the compounds closer to a 3D structure and to enable the use of higher order estimation methods that are not based on the original group- set or combinations hereof (such as, use of second-order groups). Note that the atomic representations also define the connectivity of the molecules. Therefore, property prediction methods based on connectivity indices can be employed to predict properties that could not be predicted earlier (due to unavailable group contributions) or for verifying previously estimated values.

Generat ion Algorithm for Level-3

The level-3 generation algorithm transforms the group based connectivity information (the adjacency matrix from level-2) into atom-based information. This is achieved by expanding each group into its corresponding atom-based adjacency matrix and replacing the groups in the group based description with additional rows and columns to allow for group expansion. When performing the group expansion into an atomic representation it is possible to experience that one group based description yields more than one atomic description. This is the case with compounds containing any of the groups listed in Table 5. It can be noted that the additional representations appear in the cases where the original groups have a ring element with 1 or more free connections because of the ambiguously defined distance (in the ring) between the free bonds (as in ortho/meta/para) or between hetero-atoms and bonds in aromatic rings (as in Pyridine derivates).

Table 5: Examples of first-order groups with multiple atomic representations

First-order group C5H4N

Number of isomers on an atomic basis 3

C5H3N 6 C4H3S 2 C4H2S 4

The algorithm for generation of atomic adjacency matrices from group- based ones consists of the following steps:

1. Set the matrix A equal to the group based matrix from Level 2 2. List the groups in the compound

151

3. For each of the groups in the compound: (a) Load the corresponding atom based matrix or matrices (for

groups with ambiguous 2 dimensional representation) (b) Insert the atom based matrix in the place of the

corresponding group in A. If the particular group has multiple representations create a corresponding number of copies of A

4. Identify the atoms taking part in the original bonds between groups 5. Reconnect the molecule by establishing connections between the

atoms identified in point 3 6. Stop

After performing the conversion the net result is a series of compounds described using atoms and how they are interconnected. Furthermore all 2D structural variations on the atomic level have been generated. This conversion process is illustrated in Figure 16.

P r o p e r t y P r e d i c t i o n in Level-3

In the property prediction step of level-3 the additional structural details generated through the algorithm described above is used to further distinguish between isomers and enabling the use of higher-order methods. It depends, of course, whether isomer distinction and/or use of higher-order methods are necessary. This depends on the CAMD problem specification and the types of molecules that are being generated. By having the designed molecules represented using an atomic level the feasible candidates become expressed by the "common language" of chemical information (i.e. the 2-dimensional representation) and it should therefore be possible to use all property estimation methods using this representation.

Having the 2D atomic structure enables the use of other sources of property information than those used in levels 1-2.

1. Directly by calculating structural descriptors for predicting properties by correlations (such as the boiling point method using the Kier shape index as described by Horvath (1992))

2. Indirectly by using the detailed structural information as a start ing point for the re-description of the molecule into another fragment based description different from the original source of the candidate (created in level-I).

3. Perform structural searches in databases.

Furthermore the structural information contained in description is available for the creation of 2D drawings formulas) of the candidates.

the atomic (structural

152

CH3 OH

C H3 0 I

OH I 0

l',nsert C ~3 , p , , . H:: .--:::-: ::: :::-: ::::::::::~:::::: ::: ::m::-:::: :::::::~:::: :::::::::~ ,m:: ::::::::: ::.:-:::::::: :: :: ""-- :::::::: ............... :vI~

I H H H C OH

.... H-- o Ii i H 0 I

_C 1 1 1 ....

OH 0

Inse~'t 0 H', ........................................................................................................ Ii~.

H H H C H 0

H 0 1 H 0 I H 0 1 C 1 1 1 0

........................................................................................... H 0 ........... il

I o ! o ..... ....

Reconnect the g~rou, ps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i . . . . . . . . . . .

l l w

H H H C H 0

--H FO I " H,! o I H 0 1 C 1110 1 ~ i 0 1 1

I ! ......... 0 . . . .

Figure 16: Illustration of the conversion from group to atomic representation

The re-description into alternative group-sets, thus enabling the use of other methods, serves a dual purpose. Properties already handled can be re-estimated and additional properties can be handled using group/fragment based methods capable of predicting properties not possible to predict with the original group-set (an example is the enthalpy of fusion which can be estimated using the group-set and method of

153

Joback & Reid (1987). The two options are the equivalent of increasing the property trust and increasing the property range described in Chapter 1. However, by doing this, there will be a computational cost associated with the re-description.

The ability to use prediction methods based on a higher order of structural descriptors (capturing more of the structural information of the compound) than used in the previous levels also increases the property trust since such methods can distinguish some forms of isomers (Horvath, 1992) and the predicted value therefore is an estimate for the particular isomer ra ther than the best fit to match the average of all compounds having the same description.

6.3.5 Level -4: G e n e r a t i o n o f 3D s t r u c t u r e s

In this level, generation and testing of molecules enter an interactive mode. For any selected candidate from level-3 it is possible to use molecular modeling programs such as MOPAC or Chem3D (CambridgeSoft Inc., 1997). A three-dimensional graph (or molecular model) is created by applying a set of standard bond lengths and angles for the various types of connections. Consequently, the true molecular model of a molecule that can be further analyzed in terms of conformers, stability, properties, etc. is obtained.

In level-4 the final step towards a highly detailed molecular description is taken by the conversion of the selected 2D structures from level-3 to 3D molecular models. The added dimensionality of a 3D representation yields the possibility for additional structural variations. The structural isomers possible to generate and distinguish in level-4 are the ones related to the relative steric placement of bonds and atoms. The isomer types theoretically possible to distinguish and generate are (following the definitions of Morrison and Boyd (1992)):

�9 Z/E isomers �9 R/S isomers �9 cis/trans isomers �9 Boat/Chair isomers �9 Anti/Gauche isomers

The later two isomer types are what is known as conformational isomers while the rest are configurational isomers. Conformational isomers can be created by rotating single bonds and are controlled by the internal energies of the compound. The configurational isomers, however, cannot be transformed into each other by rotation around single bonds. The generation algorithm of level-4 considers only the distinction between configurational isomers and leaves the conformational isomer analysis and distinction to the post-design phase of the hybrid CAMD method. The reasons for this lie in the fact that the conformational isomer behavior (or

154

simply the conformation) of a compound is dependent on the state of a compound (temperature, pressure) as well as the presence of other compounds in the immediate environment and requires very specialized tools in order to analyze the conformational space. Furthermore, in a bulk phase of a compound no single conformer will be the only one present. Instead there will be a Boltzman distribution of the conformers depending on the energy level of each possible conformer (Jonsdottir, 1995).

Generation Algorithm for Level-4

The basis for the generation in level-4 is the addition of

�9 Hybridization information (i.e. the bond configuration and standard angles between the bonds)

�9 Placement in a x,y,z coordinate space

for each of the atoms in each adjacency matrix description obtained from level-3. For a single compound representation the level-4 promotion algorithm is:

1. Select an atom participating in 1 bond, add the atom to Y (the collection of used atoms).

2. Assign the x,y,z position of origin (0,0,0) to the atom. 3. Set the bond direction (D) to 0,0,1 for the free connection. 4. Add the free connection to the tail of the list of free connections F. 5. Select the free connection U from the head of F. 6. Find the atom M participating in the connection U and not part of Y 7. Determine the hybridization of the atom based on the atom type

and the number and types of bonds it participates in. 8. Determine the (x,y,z) position PM of the atom by calculating

PM - a + D U + Pu (1)

where a is the bond length (the bond length can be fixed or dependent on the bond type) and Pu is the position of the other

atom participating in bond U.

9. Remove U from F. 10. Add the free bonds of M to F. Each free bond obtains the bond

direction information by rotating the base configuration for the atom (the hybridization) in such away that the previously made connection is superimposed on D U

11. A d d M t o Y . 12. If Y does not contain all the atoms go back to 5. 13. If F is not empty (only possible for cyclic structures) create the

connection pairs for the remaining free connections based on the original connections in the level-3 description.

155

14.

15.

16.

For each atom: (a) Analyze for the existence of chiral centers (R/S isomers). (b) If found, duplicate the entire structure and swap the

positions of any two substituents. For each double bond:

(a) Analyze for the possibility of Z/E isomerism. (b) If found, duplicate the compound and swap the positions of

the substituents on either of the atoms participating in the double bond.

For each single bond in a ring between atoms participating in 4 single bonds:

(a) Analyse for the possibility of cis/trans isomerism. (b) If found, duplicate the compound and swap the positions of

the substituents on either of the atoms participating in the bond.

Figure 17 illustrates the conversion of the atomic description to a 3D model. The analysis for the presence of R/S, Z/E and cis/trans isomers is done using the extended ACMC method (see appendix B at the end of this chapter) by calculating and comparing the codes for the atoms participating in each analyzed substituent.

I onve r t~

S ta r t i ng point C H ~

CH3 CH

~ C H 2 ~ ~ C H 3

~ i : i . . . . . . . . . . . . . . . . . . ~ . . . . . . . . ~ . . . . . . . . ~ . . . . . . . . ~ . . . . . . . . ~ . . . . . . . . ~ . . . . . . . . i . . . . . . . . ~ . . . . . . . . i . . . . . . . . ,~ . . . . . . . . i . . . . . . . . i . . . . . . . . l " " " l i . . . . . . . . i . . . . . . . . . . . . . . . . ~ . . . . . . . .

i~~ .................. ; ; t ........ ~ ....... t ~ 1 ....... t ....... .:" ....... l t . ........ ~ ........ ~ ~ ........ ~ ........ ; " : ( : E ~

I - ~ , . . . . . . . . . . . . . . . . . . ~ ~ ....... . . . . . . . t ~ , , ....... ....... ....... ....... i ~ , . . . . . . . . . . . . . . . . . . . . . ~, ~ . . . . . . . . . . . . . l i I \ 1

Conve r t~ l ! , . . . . . . . . . . i ........ i ......... ! ......... ~ ....... ~ ....... i ..... ~ ... . . . i ........ ~ ......... ~ ~ . . . . . ~ ....... ~ ........ I ....... i ........ i 1 t . . . . i , / 1 . .................. i ......... ! ....... :..." ....... t ....... t ........ i ....... i ....... t ....... t ........ ~ ....... t ........ I ........ I ........ ~ ........ ! ........ i ........ ~1 i V l~ .................. i ........ i ........ i ........ i ........ i ........ i ........ ~ ........ i ........ i ........ ~ ........ i ........ i ........ i ........ ~ ........ i ........ i ........ i . i ~

,-e- ................. t - - # l t ~~ ........ t ....... t ....... i ....... t ....... :.. ........ i ....... t ........ i ........ i i ~ I ........ i ........ i ........ i ~ ..................................... ~ ........ ~-----i~----:i~ ......... i ....... i .......... i ......... ! ~ ........ i ........ ! ~ ! ......... i : i ~ ........ ~ ........ i I ~ .................. ~ ........ i ........ ~ ........ i ........ ~ ........ ~~~ ........................ i ................ i ........ ~ ........ i ~~ ........ i ~ I~

I ~ ... . . . . . . . . . . . . . . . ! ........ ~ ....... ~ ........ ; ......... ~ ....... ": ....... *" ....... } ....... ~ ........ ~ -1T1 ] -1 - i ........ i . . . . .... i 1~ ........ i . . . . . . . . i

I onve r t~

T r i pos I Na t i ve A l chemy

i npu t f i l e

D~ Aut~176 I Chem3 modu le I

invoked I

Figure 17: Illustration of the process of creation of 3D models

Capabil i t ies and Limitations of the Level-4 Generat ion Algorithm

It should be noted that the algorithm for generation of 3D molecular structures does not consider 2 important aspects:

1. Torsion angles in the final structures are random. This is due to the fact that the algorithm only examines and considers 1 bond at a time. If torsional information should be included in the placement

156

calculations the algorithm would have to examine 2 additional bonds for each atom placement.

2. For compounds containing cycles there is no guarantee that the generated 3D models contain cycles with uniform bond length. Since the ring-building process is completed by forming bonds without consideration for the length it is possible to obtain models containing very deformed rings.

Both of the above limitations may be handled through external molecular modeling software capable of property prediction and/or generating descriptor information. Note that the analysis of the 3D structure of rings and the torsional state of a molecule need to be investigated as parts of the examination of conformational isomers. Furthermore, their "correct" values are heavily dependent on the methods used to calculate the properties of the compounds (i.e. energy minimization performed with the MM2 method (CambridgeSoft Inc., 1997) may lead to different results than those that can be obtained using the AM/3 force field of MOPAC (CambridgeSoft Inc., 1997)).

6.4 POST-DESIGN PHASE

In the post-design phase the results from the design-phase solution engine are analyzed with respect to properties and behavior that could not be part of the design considerations. Examples of such properties and behavior are price, availability, legislative restrictions, process wide performance and many more. At the end of this analysis the final selection of the product identity must be made.

6.4.1 Analys i s of des ign so lut ions

The analysis involves using other sources not considered in level-1 to level-4. Examples of such sources could be"

�9 Property estimation and molecular modeling tools for validation of predicted properties not handled by the CAMD algorithm and/or validation of the properties estimated during the design phase.

�9 Databases for examination of environmental or legislative requirements as well as reaction pathways.

�9 Supplier catalogues for price and availability information. �9 Engineering insight and simulation tools such as mixture analysis

and phase behavior calculations.

Which tools and data sources to use, depend on the original CAMD problem specification.

Databases, process synthesis/design tools, process modeling & simulation tools, analysis tools, etc., are all useful in this phase. Also, analysis based

157

on experiments and/or experimental data should be considered. Finally, web-based database search, if possible, could also be carried out. This is part icularly useful for verification of EH&S properties.

6.4.2 F inal Candidate Se l ec t ion

After validation of the obtained results the final candidates must be selected. This selection must take all the available information into account including socio-economic aspects and the out of process (or indeed life cycle) performance of the different compounds.

I n t e g r a t i o n of P r o c e s s - P r o d u c t Des ign

This type of selection procedure is beyond the scope of this book but the presented CAMD framework has been used successfully in process design algorithms addressing the process-wide environmental performance with respect to energy consumption and emissions control (Hostrup et al., 1999). In the approach of Hostrup et al. (1999) the presence or absence of each result from CAMD in the process is controlled by integer variables in a super-structure formulation of the process design problem and subsequently selected using an MINLP solution algorithm. The method shows tha t the design/selection of compounds for a part icular purpose can be performed as a subproblem of a larger process design problem. The benefit doing this ra ther than including the compound design in the overall problem formulation is that of being able to use external sources of data to validate the estimations as well as enabling the use of computationally more complex models for property estimation. The benefits are achieved without sacrificing any versatili ty since solving the CAMD problem as a subproblem with the proposed method identifies all compounds possessing the properties essential to the desired functionality as well as making it possible to screen out less desirable candidates by adjusting the property constraints related to performance and environment. The developed framework is therefore very suited for use with the advanced methods for impact minimization developed by other researchers (such as the MEIM method by Pistikopoulos et al. (1994) and the WAR algorithm by Cabezas et al. (1999)).

In any real application of CAMD the final testing involves experimental determination of key properties and behavior regardless of what method is used to select the final candidates. The power and purpose of CAMD is to limit the number of candidates to those showing the maximum potential and not to replace experimental testing.

6.5 IMPLEMENTATION OF THE FRAMEWORK

The proposed framework has been partially implemented as a computer program "ProCAMD" (ICAS Documentation, 2002). The screening on the

158

basis of the atomic representation (level 3) is done using external tools, specifically the property prediction program "ProPred" (ICAS Documentations, 2002) and the commercial drawing and property estimation program "ChemDraw Ultra 2000" (CambridgeSoft Inc., 1999). The "ProPred" package includes an implementation of the extended ACMC method (see Appendix). The treatment of 3D structures (the results from level 4) has been performed in the commercial molecular modelling program "Chem3D Pro" (CambridgeSoft Inc., 1997).

Figure 18." Link between ProCamd and Chem3D

Figure 19 shows the modular structure of the implementation along with the dependencies of the modules and the external programs and data sources it is connected to. It has been a goal of the development to enforce a structure where each of the major parts of the algorithm was represented by separate modules of code thereby making it easy to update and modify the code, as well as having the opportunity to create custom solutions in the future by extracting selected modules and inserting them into another framework.

6.5.1 E x t e n s i o n of the Hybrid CAMD Method to Complex Molecu le s

The hybrid CAMD method has been extended (Nielsen, 2000) to include a new database of large complex molecules containing their pure component data and solubility data in known solvents. The search for solvents starts with defining the solute structure, determining the pure component properties (if not available in the database), generating the group representation and evaluating the property model parameters in terms of sensitivity of parameters to calculations of solubility and generation of solubility versus solubility parameter of solvent diagrams. The maximum of the solubility in these plots identify approximately the solubility parameter of the complex solute molecule and therefore, the target properties of the desired solvent. The algorithm is shown in the form of a block diagram in Figure 20. It has been applied for solution of the CAMD problems discussed in chapter 8.

159

Figure 19: Structure of ProCAMD highlighting methods & tools employed.


Several examples of the application of the methodology are given elsewhere in this book. See for example chapters 8 and 9 where applications of ProCAMD are highlighted for solvent design problems. Besides solvent design applications (in chapters 8 & 9), the following simple molecular design problems are suggested for the reader as tutorial exercises.

�9 Find all organic molecules with C, H & O atoms having normal boiling points between 300 K and 400 K that form azeotropes with ethanol at 1 atm pressure.

�9 Find non-aromatic organic molecules that when added to a mixture of acetic acid-chloroform in the liquid phase, causes a phase s p l i t - assume a temperature of 300 K and a pressure of I atm.

�9 Find all cyclic organic molecules with C, H & O atoms that have the same normal boiling point (equal or lower), Hildebrand solubility parameter, melting point (equal or higher) as benzene but not its EH&S properties.

160

Figure 20: The extended hybrid CAMD method

Find how many chemically feasible molecules can be formed with the groups CH3, CH2, CH, OH, CHO, CH3CO, CH2CO considering a minimum of 2 groups and a maximum of 5 groups.

161

Find all compounds that match the following property constraints 475 K < normal boiling point < 525 K 325 K < normal melting point < 375 K -250 kJ/mol < Heat of fusion at 298 K < -220 kJ/mol

- 0.75 < Log Octanol-water partition coefficient < - 0.50 4.0 < Log water solubility (log mg/L) < 5.5

Solutions to the above problems can be obtained from R. Gani ([email protected]).

6.7 CONCLUSIONS

The hybrid CAMD method could be regarded as a general purpose methodology that provides then framework for future developments needed to solve current and future problems in area of product and formulation design. The framework is flexible enough to provide the link between molecular structure representation and property estimation at different scales of size. It also provides link with databases and knowledge-based systems needed for pre-design and post-design phases. Although most of the examples (employing this methodology) shown in this chapter and elsewhere in the book deal with selection and design of solvents, it can and has been employed for fluid design, search for azeotropes, search for polymer repeat unit structures, search for additives and many more. The vast collection of property models integrated to the ProCAMD software makes the application range quite large. Current and future work is extending the methodology towards design of larger molecules and isomers typically found in design of drugs, pesticides, speciality chemicals and polymers.


1. H. Cabezas, J. Bare and S. Mallick, "Pollution prevention with chemical process simulators: The generatized waste reduction (WAR) algorithm", Computers and Chemical Engineering, 23 (1999) 623-634.

2. CambridgeSoft Inc., Chem3D Pro Users Guide, CamSoft Inc., Cambridge, MA, USA, 1997

3. CambridgeSoft Inc., ChemDraw Ultra 200 Manual, CamSoft Inc., Cambridge, MA, USA, 1999.

4. L. Constantinou and R. Gani, "New group Contribution Method for the Estimation of Properties of Pure Compounds", AIChE J., 10 (1994) 1697-1710.

5. R. Gani, B. Nielsen, A. Fredenslund, "A Group Contribution Approach to Computer Aided Molecular Design", AIChE J., 37 (1991) 1318-1332.

162

6. P. M. Harper, "A Multi-Phase, Multi-Level Framework for Computer Aided Molecular Design", PhD-thesis, Technical University of Denmark, Lyngby, Denmark (2000).

7. L. Horvath, "Molecular Design. Chemical structure Generation from the Properties of Pure Organic Compounds", Studies in Physical and Theoretical Chemistry Book Series, Volume 75, Elsevier, Amsterdam, The Netherlands (1992).

8. M. Hostrup, P. M. Harper, R. Gani, "Design of Environmentally Benign Processes: Integration of Solvent Design and Process Synthesis", Computers and Chemical Engineering, 23 (1999) 1395- 1414.

9. ICAS Documentations, Internal Report PEC02-14, CAPEC, Department of Chemical Engineering, DTU, Lyngby, Denmark (2002).

10. K. G. Joback, R. C. Reid, "Estimation of Pure Component Properties from Group Contributions", Chemical Engineering Communications, 57 (1987) 233-243.

11. K. G. Joback, G. Stephanopoulos, "Searching Spaces of Discrete Solutions: The Design of Molecules Possessing Desired Physical Properties", Advances in Chemical Engineering, 21 (1995) 257-311.

12. S. Jonsdottir, "Theoretical Determination of UNIQUAC Interaction Parameters", PhD-thesis, Technical University of Denmark, Lyngby, Denmark (1995).

13. J. Marrero, R. Gani, "Group-contribution based estimation of pure component properties", Fluid Phase Equilibria, 183-184 (2001) 183.

14. R. T. Morrison, R. N. Boyd, "Organic Chemistry", 6 th Edition, Prentice-Hall Inc., New Jersey, USA (1992).

15. M. B. Nielsen, "Solubility Prediction of Complex Compounds with UNIFAC", MSc-thesis, Technical University of Denmark, Lyngby, Denmark (2000).

16. E. N. Pistikopoulos, S. K. Stefanis, A. G. Livingston, "A Methodology for Minimum Environmental Impact Analysis", In AIChE Symposium Series on Pollution Prevention Through Process and Product Modification, AIChE, New York, USA (1994).

17. V. S. Raman, C. D. Maranas, "Optimization in Product Design with Properties Correlated with Topological Indices", Computers and Chemical Engineering, 22 (1998) 747-763.

18. Y. Xiao, Y. Qiao, J. Zhang, S. Lin, W. Zhang, "A Method for Substructure Search by Atom-Centered Multilayer Code", Journal of Chemical Information and Computer Science, 29 (1997) 701-704.

163

A P P E N D I X

A: E q u a t i o n s Used in Generat ion Algor i thms

If an unres t r ic ted exhaust ive enumerat ion of all possible combinations of groups is performed, the number of a l ternat ives is given by (Joback and Stephanopoulos, 1995),

u . . . . (N + K - ~)!

h ' = h".,.,, i-., (A.1)

Eq. A.1 is derived from

K=Km~n

where, (A.2)

M ~ , K = : ( N + R - 1 ) K (A.3)

In the above equations, K is the number of e lements in a population of size N, while M is the number of combinations.

The valence constraints are expressed as,

g ~ n~g -I- ~-D -[- 7~T + n Q (A.4)

ns = n~T + 2nQ - 2{A - 1) (A.5)

In Eqs. A.4 and A.5, n s, riD, nT, and nQ are the number of groups in a

molecule with 1-free a t tachment , 2-free a t tachments , 3-free a t t achments and 4-free a t tachments , respectively, while A is the m a x i m u m number of rings in a molecule.

The classification of compounds in terms of classes and categories (see Gani et al. 1991) is used to control the chemical feasibility of the generated molecules. In addition to the constraints A.4 and A.5, the following conditions are considered.

Constrains re la ted to category-2 groups

"~'~ ~:~ L~ . ~ . . . . . . (A.6)

Constrains related to category-3 groups

G 'C;~) < ns,~ + n~,z + + nQ,s (A.7)

Constrains re la ted to category-4 groups

164

(~ < uS,a + nr2.4 + uT.a + nQ,.,l (A.8)

Constrains related to category-5 groups

" K ~LvN~ . . . . .

Constrains related to categories 4+5 (combined) groups (A.9)

.~ICL;A~L ~ uS,4 + ~D~4 + nTA + nQA

+U.,'q,~ + rt.Dj~ + nT~5 -{- nQA (A.10)

Constrains related to categories 3+4+5 (combined) groups

C I~+4+s~ _< +~,D~a + ~ +nr § + ~,D,4 + nT, a + nQ,a

(A.11)

The number of combinations for the j'th solution to the category constraints for the i'th solution to the valency constraint is given by

(A.12)

For a solved illustrative example (Harper, 2000) and the corresponding group classification tables, contact R. Gani ([email protected]).

B: Molecular Encoding Technique

The applied method of group and fragment identification is based on the generation and identification of molecular codes (molecular "fingerprints"). By applying an encoding method to both molecules and fragments a set of numbers is obtained for each molecule and fragment. Presence or absence of a part icular fragment in a molecule is determined by examining the code sets of a molecule and the fragment i question. If the fragment's codes are a subset of the molecule's the fragment is present in the molecule.

The encoding of the molecular and fragment fingerprints is done using an expanded and adapted version of the Atom-Centered Multi-Layer Code approach of Xiao et al. (1997). The method has been improved to allow for additional flexibility in the definition of fragments and better handling of bond type (cyclic/acyclic) considerations.

165

. . . . . . . . . . . . . . . . . . . . . Levds . . . . . . . . . . . . / f - - - - - " - ~ , o 1 2 t 3 4 5

...... ..... ?~ .... c . c . c NI c . c N c N "" . . . . . . . . . . . . . / " ' - . 20 7 2813 1 4713 1[ 7093 119853 134933

" . / " - . I 5oo6 1 8606 1121422 231542 23~82 2 �9 " ' " " ~ . " ~ "', ", ] _ . 7502 4 13302 4123802 2 34362 2

. . . ~ . / . ~ , . . ~- . . . . . � 9 -~.. ,~.. " . t.:o0es 10001 1 17501 11 28281 1 I I '. ". / ' - / . . . ~.' , ~,'~ ) "\ :

,. . : . . ~ . . . . ."~, . "' . . : , . ( " - . . . . . - | .' s

~ ~ ' ~ 1 /" "' / . . ~ ' . ~ .'7"---,r . . . . . . ~ Levels

. ....... '"'~:'" ~ ""'"" " ~ . '-k~"" "' c I N I c I N [~ . . . . . " . . . . . . . . . �9 . . . . : "'~'" , " ~ " , "" : ~ , 201 2 28131 1 4713

" - . . . - ' " ', " ' . .- ' : tcooes -^0- I 1 [ .... . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . - - . . " . { ~o Ol

Figure 21: Examples of codes for a molecule and a fragment

E x a m p l e s o f t h e r e s u l t s o f t h e m o l e c u l a r e n c o d i n g t e c h n i q u e a r e

i l l u s t r a t e d t h r o u g h F i g u r e 21 . I t c a n b e s e e n t h a t t h e c o d e - s e t fo r t h e

f r a g m e n t is a s u b s e t o f t h a t of t h e d e p i c t e d m o l e c u l e a n d i t c a n t h e r e f o r e

b e e s t a b l i s h e d t h a t t h e f r a g m e n t is p r e s e n t i n t h e m o l e c u l e .


C h a p t e r 7: Ident i f i ca t ion of Mul t i s t ep R e a c t i o n S t o i c h i o m e t r i e s : CAMD P r o b l e m F o r m u l a t i o n

A. Buxton, A. Hugo, A.G. Livingston & E.N. Pistikopoulos

Reaction path synthesis and the selection of an optimal route for the manufacturing of a desired product provide the earliest opportunities for waste reduction when designing environmentally sound processes. In the work presented here, a systematic procedure for the rapid identification of alternative multi-step stoichiometries is described in which minimum environmental impact considerations are incorporated. Both the size and complexity of the reaction path synthesis problem are reduced by decomposing it into a series of steps. First, a new group based co-material enumeration algorithm, introduces material design principles through structural and chemical feasibility constraints to rapidly generate a manageable set of raw materials and co-products. Next, stoichiometries are extracted from the co-material set using a two step optimisation procedure, including whole number stoichiometric coefficient constraints, carbon structure constraints and case specific constraints based on chemical knowledge. Thermodynamic, economic and environmental impact criteria are employed in the evaluation of feasible stoichiometries, with aspects of the Method- ology for Environmental Impact Minimisation (Pistikopoulos et al., 1994) providing the framework for the environmental evaluation of alternatives.

7.1 INTRODUCTION

In the synthesis of a facility to manufacture a given desired product, the selection of an appropriate chemistry provides the earliest opportunity to influence the environmental and economic performance of the process. However, reaction route design and selection is a large and difficult problem. The major difficulty is attaching enough information to a particular reaction route alternative to make an informed choice about the potential of the route to be developed in to a promising process. When only chemistry is known, it is difficult to quantify the costs and wastes associated with the eventual process in which the chemistry will be carried out, because of the large number of sources of expenses and waste which are not directly related to chemistry, and the range of different process topologies and equipment which may be associated with alternative

168

chemistries. This problem is compounded by the fact that the vast majority of reaction schemes of industrial interest are of a multi-step nature (Rotstein et al., 1982).

Recognising these problems, it seems more sensible to identify candidate multi- step reaction routes rapidly according to some simple criteria using limited information, than to devote time and resources to developing detailed reaction schemes which may be rejected later due to poor process performance.

However, the synthesis of alternative reaction paths leading to a desired product was initiated by organic chemists who were interested in synthesising large, complex molecules, albeit in more efficient ways. Consequently, their approaches tended to concentrate on the generation of chemistries, ra ther than on the selection of promising routes. Agnihotri and Motard (1980)categorised these tools as information based systems and logic based systems, according to the reaction representat ion technique employed.

Information-based systems have their roots in real chemistry. Molecules are represented in terms of their atomic or group constituents and reactions, known as transforms, are based on real, known chemical transformations (Corey et al., 1969, 1972, 1976; Gelernter et al., 1973; Wipke et al., 1976; Govind and Powers, 1981; Kaufmann, 1977; Knight, 1995; Mavrovouniotis and Bonvin, 1995). The development of appropriate transforms relies heavily on chemical knowledge while each transform may carry with it information relating to the molecular substructures to which it can be applied and the structural alterations it brings about - requiring details of any by-products which are produced or reagents which are required as well as typical operating conditions for the reaction and kinetic information.

Consequently, according to Govind and Powers (1981), information based systems offer good predictive power, in that they are able to represent specific distinct reactions in detail. However, they suffer poor generality, since their ability to represent different reactions is limited to the available transforms. Further- more, a large data base of information or a set of predictive techniques is required to implement such approaches.

Information based systems most commonly build synthesis trees, usually working ~ backwards (retrosynthesis) from the product. Retrosynthesis is an open ended problem which may lead to the development of a large network of reaction schemes and corresponding materials, even with a small number of transforms. Accordingly, the screening requirements which go along with information based systems are typically large.

By comparison, logic based systems are much easier to handle and control. These methods employ purely mathematical representations for molecules and

169

their reactions (Ugi and Gillespie, 1971; Hendrickson, 1976). The most widely studied logic-based approach is centred around an atom balance, a matrix equation which describes the chemistry of a particular set of predetermined species, and from which stoichiometries leading to a particular product can be extracted (Rotstein et al., 1982; Fornari et al. 1989, 1994a, 1994b; Crabtree and E1- Halwagi, 1994; Holiastos and Manousiouthakis, 1998). In this approach, only chemical formula information is required to generate stoichiometries, so that this approach provides a much more direct route to alternative multi-step reaction schemes.

However, in order to apply this approach, all candidate raw materials and stoichiometric co-products (which will henceforth be referred to collectively as co- m a t e r i a l s ) must be known in advance and included in the matrix. While the careful pre-selection of these materials provides an early opportunity to limit the size of the problem and to screen out poor materials, no systematic method has been proposed to generate these materials.

While Fornari et al. (1989, 1994a, 1994b) and Crabtree and E1-Halwagi (1994) limited themselves to single step reactions, Rotstein et al. (1982) and Holias- tos and Manousiouthakis (1998) demonstrated the potential of the approach to develop multi-step reactions by considering closed cycle sequences of reactions known as clusters. A cluster of reactions is a sequence of thermodynamically feasible reactions in which the intermediates produced by the reactions in the cluster must also be consumed by other reactions in the cluster, with the net result being an overall main reaction which is thermodynamically infeasible, and therefore not directly achievable (Rotstein et al., 1982). In cluster synthesis, this main reaction must be specified in advance. Rotstein et al. (1982) also applied their approach to open cycle sequences of reactions, in which the intermediates produced within the sequence are not completely consumed. However, although they introduced unspecified raw materials and co-products, they limited themselves to overall reactions in which the desired product and certain of the raw materials were specified in advance.

Without careful consideration, stoichiometries generated by the matrix based approach may involve any number of apparently simultaneous reactants and co-products (so that a single stoichiometry may in fact be decomposable in to several sequential steps) with stoichiometric coefficients that may take any values. Buxton et al. (1997) were the first to tackle these problems directly, introducing linear whole number stoichiometries constraints together with limitations on the numbers of reactant and product species. Recently, Holiastos and Manousiouthakis (1998) introduced non-linear integer constraints to perform the same functions in the context of reaction cluster synthesis. They defined allowable chemical reactions according to the general characteristics of elementary reactions, which depict chemical transformations as they truly happen at

170

the atomic scale, and applied their constraints accordingly. Using a modified branch and bound solution procedure they circumvented the non-linearity of their integer constraints. Extensions have predominantly concentrated on the application of integer programming techniques to the design of simplified reaction mechanisms for improved computational efficiency (Androulakis, 2000; Edwards et al., 2000; Sirdeshpande et al., 2001).

The key advantage of such information based systems is that they can provide kinetic information for the preliminary screening of reaction routes. Knight (1995) employed computational chemistry involving statistical mechanics and probability theory to determine products, their distribution and the reaction rates, while Mavrovouniotis and Bonvin (1995) used chemomet r i c s - the simulation of reaction systems with kinetic models and principal factor analysis to identify the major pathways. Consequently, the information and computational requirements of these approaches are large.

Although the predictive power of the matrix based approach is poor, since it provides much less information, much simpler criteria can be applied to identify promising candidate stoichiometries, or at least to eliminate poor alternatives. Simple economic criteria, based only on the values of products and reactants have been employed by Fornari and Stephanopoulos (1994b). Gibbs free energy of reaction has been used to provide an initial indication of the cost feasibility of a process: conversion, yield, recycle flows, difficulty of separation etc. (Fornari and Stephanopoulos, 1994b), to indicate the directionality and reversibility of reaction steps (Mavrovouniotis and Bonvin, 1995), to determine equilibrium concentrations among reacting species (Crabtree and E1-Halwagi, 1994) and to provide an upper limit for thermodynamic feasibility (Agnihotri and Motard, 1980; Fornari et al., 1994a, b, 1989; May and Rudd, 1976; Rotstein et al., 1982)- a Gibbs free energy change of reaction of 10 kcal/gmol has long been accepted to provide an upper bound for the thermodynamic feasibility of reactions (Rotstein et al., 1982). Rotstein uses this criterion to determine the temperature range over which reactions are thermodynamically feasible.

The only documented reaction route design technique to take explicit account of environmental issues is that of Crabtree and E1-Halwagi, (1994). In order to select an i nnocuous stoichiometry, they imposed simple concentration limits on certain compounds in the reactor effluent stream. However, this approach does not provide a consistent method of assessing the environmental impact of alternative reaction routes since only the effluent concentrations of certain compounds were considered (not the impac t s of all compounds). Furthermore, it is unlikely that the reactor effluent, or even the by-products or co-products would be emitted directly to the environment. Moreover, the input wastes associated with the raw materials and the impacts of downstream processing are not included.

171

In the work presented here, a procedure for the rapid identification of alternative multi-step stoichiometries is developed. Material design principles are introduced to formalise the development of a set of co-materials and an optimisation procedure, based around the matrix representation, is employed to extract stoichiometries from this set. Linear constraints are developed to limit the number of reactant and product species and to ensure that each stoichiometric step involves whole number stoichiometric coefficients. Thermodynamic, economic and environmental impact criteria are employed in the evaluation of the stoichiometries, while aspects of the Methodology for Environmental Im- pact Minimisation (MEIM) of Pistikopoulos et al. (1994) provide the framework for the environmental evaluation of alternatives. The application of the method is illustrated in this chapter through an example; the synthesis of production routes for the pesticide 1-naphthalenyl-N-methyl carbamate also known as carbaryl. While in Chapter 14, the main features of the methodology is further highlighted through a second case study; the production of acetic acid, an important alipahtic intermediate.

7.2 IDENTIFICATION OF ENVIRONMENTALLY B E N I G N STOICHIO- METRIES

The problem addressed here may be stated as follows:

Given a desired organic product

Identify a set of candidate multi-step organic reaction stoichiometries for the production of the desired product which are both economically and environmentally promising.

A three step procedure is applied, involving: (i) selection of co-material groups, (ii) determination of a set of candidate co-materials using group based molecular design techniques and (iii) identification of a set of promising candidate multi-step stoichiometries using the matrix based representation system and an optimisation procedure incorporating aspects of the MEIM.

The use of such a structured, stepwise procedure reduces the multi-step stoichiometry identification problem to a manageable size. The key to the procedure is the introduction of co-material design (steps (i) and (ii)). With the product and stoichiometric co-materials known, the identification of feasible reaction stoichiometries is no longer an open ended problem. The steps of the procedure are described in the following sections.

172

7.3 CO-MATERIAL D E S I G N

7.3.1 I n t r o d u c t i o n

Co-material design is based on the observation that much organic chemistry essentially consists of reorganising functional groups, through additions, substitutions and eliminations, so that co-materials are expected to contain (at least) the chemical groups present in the desired product. According to this observation, a group based computer aided design approach is adopted for co- material design. The aim of this approach is to systematically enumerate a set of alternative stoichiometric co-material candidates from a group set selected according to the groups present in the product, those present in any existing industrial co-materials, the types of chemistries to be considered (e.g. aromatic or aliphatic) and other considerations such as property constraints.

Groups are employed as the molecular building blocks rather than atoms for several reasons. First of all, this considerably reduces the combinatorial size of the molecular generation problem without much loss of genera l i ty- very many organic compounds can be constructed using only a small number of groups. Secondly, a suitable choice of groups (e.g. UNIFAC groups) gives direct access to physio-chemical, thermodynamic and environmental properties through group contribution methods. Finally, with appropriate group bonding restrictions, such a method provides a short cut to structurally and chemically feasible molecules, hence significantly reducing molecular screening requirements.

Any of the molecular design techniques reviewed by Buxton (2002) may be applied to generate sets of candidate materials. However, of the variety of available techniques, only the enumeration and knowledge based approaches are specifically designed to explicitly enumerate molecules from a pre-selected set of groups. All other approaches can be viewed as implicit enumeration strategies, in which the aim is to identify optimal structures through evolution or optimisation without explicitly constructing all alternatives. Thus, the knowledge based and enumeration approaches represent the best candidates for use in co-material design.

Of these, the most general approach is that of Gani and coworkers, as reported by Constantinou et al. (1996). This procedure is UNIFAC group based, and includes in the enumeration algorithm rules designed to ensure that only structurally and chemically feasible molecules result from the molecular design exercise. These two features make this approach the most attractive starting point for co-material design. Although structural and chemical feasibility rules feature in the other group based techniques, Derringer and Markham (1985) focussed only on polymers, Joback, Stephanopoulos and coworkers (1984, 1989,

173

1995) employed a generate and test paradigm, applying their rules only after generating all possible combinations of groups, and Porter et al. (1991) considered only certain homologous series.

The computer aided product design (CAPD) approach reported by Constantinou et al. (1994, 1996), is based on a system of group classification and categorisation. A total of one hundred and eight unique UNI FAC groups are featured in the technique, including nine aromatic groups. These groups are divided into nine classes and five categories. The class of the group (0 - 4) represents the number of free at tachments of the group (i.e. the group valency) and the category signifies the level of restriction for bonding with other groups - the higher the category the tighter the restrictions. The aromatic groups are placed in classes 5 - 8, class zero consists of some simple complete molecules.

The molecular design algorithm is based on a set of pr imary and secondary conditions. The primary conditions ensure structural and chemical feasibility, firstly by guaranteeing that the complete compound has zero valency and secondly that it obeys the principles of chemistry. These principles have been embodied in a set of rules which determine the maximum permissible number of groups from any category which can be present in a molecule and the permissible combinations of groups from the different categories. The secondary conditions are related to restrictions arising from the limited validity of the group contribution properties prediction methods.

The rules based on the primary conditions are divided into three sets; a set each for acyclic, cyclic and aromatic molecules, and the UNIFAC groups have been divided in to three sets (which share many common groups) according to the desired molecular structure. From these group sets, the rules allow for the design of cyclic and acyclic molecules of up to twelve groups and of aromatic molecules of up to eighteen groups with a maximum of three aromatic rings. The molecular design algorithm has been developed to systematically generate all molecules which satisfy these conditions. However, some feasible structures are rejected because of doubtful stability or because group parameters are not available.

Nevertheless, despite this conservatism, this technique can potentially generate thousands of molecules (Constantinou et al., 1996), which is more than adequate for co-material design. Furthermore, the approach provides very well for the inclusion of the additional structural restrictions which may be necessary in co-material design. Full details of group classification, categorisation and division, and of the primary chemical feasibility rules are provided in Constantinou et al. (1996), and a description of the enumeration algorithm is presented in Gani et al. (1991). It is these rules which form the basis of the co-material design procedure presented in the following sections. In addition to these rules, the co-material design procedure features other rules based on engineering and

174

chemical insight, which are designed to reduce the size of the enumeration problem.

7.3.2 Co-Material Design Procedure

GROUP PRE-SELECTION Group pre-selection is the first step towards designing co-material molecules and has the most direct effect on the number of molecules generated. To restrict the size of the enumeration problem, the following simple rules are employed to guide group pre-selection: (i) select the groups present in the product, (ii) select the groups present in any existing industrial raw materials, co-products or by-products, (iii) add groups which provide the basic building blocks for the functionalities of the product or of similar functionalities, (iv) add groups from the group sets for the desired chemistry (cyclic, acyclic or aromatic) and (v) re- ject groups which violate property restrictions (e.g. chloro groups may violate environmental res t r ic t ions- Gani et al., 1991).

CO-MATERIAL ENUMERATION FORMULATION The co-material enumeration formulation consists of four sets of equations; chemical feasibility rule equations (based on the rules provided by Constanti- nou et al., 1996), the octet rule for structural feasibility, additional problem specific s tructural restrictions and the objective function. It is assumed tha t this approach provides all interesting organic co-materials and that all generated molecules are chemically feasible. The existence of generated molecules may be verified from chemistry li terature (Compounds, 1996), although such sources tend to include rare compounds which may be unlikely co-materials. The sets employed in the algorithm are shown in Table 1.

Table 1: Co-Material Enumerat ion Model Sets

J chemical groups CL group class C T group category R chemical feasibility rules

Chemical Feasibil i ty Rules In Constantinou et al. (1996), the chemical feasibility rules are given according to the categories of groups. Category one groups have no bonding restrictions. According to Gani et al. (1991) category two groups of classes 1 - 4 are special groups which can appear more than once but cannot be connected with each other or with another group from the same or higher category. Since there are only six category two groups in classes 1 -4 , only one of which (the chloro group) is included in the example problems considered here, no general rules reflecting

175

these restrictions were included in the co-material enumera t ion formulation. To avoid violation of these restrictions, integer constraints are ins tead included on a case by case basis.

For categories 3 - 5, the chemical feasibility rules are presented in tabular form, with a separate table for acyclic, cyclic and aromatic molecules in Constant inou et al. (1996). In each table the columns are: the total number of groups in a molecule, the largest class of group present, the number of groups from this largest class, the m a x i m u m allowable number of groups from category 3, the m a x i m u m allowable number of groups from category 4, the m a x i m u m allowable number of groups from category 5, the total number of groups allowed from categories 3, 4 and 5 together, and the total number of groups allowed from categories 4 and 5 together.

Thus, each row in the tables represents a unique set of rules for the allowable numbers and combinations of groups from categories 3, 4 and 5 according to the total number of groups, the largest class of group present in the molecule and the number of groups from this largest class. Above a certain total number of groups, it is possible to construct molecules with the same total number of groups in which the largest class of group is different, and in which the number of groups from this largest class is different. Thus, there can be several rows in the table and therefore several rule sets, for a par t icular total number of groups.

In order to enumera te co-materials, each table co lumn is first wr i t ten as an R x 1 vector, where R is the number of rule sets (i.e. the number of rows in the rule table). However, the t r ea tmen t of classes is somewhat different than in the tables. Ins tead of wri t ing a largest class vector, and a number of groups from this largest class vector, two vectors are wri t ten for each class, one which gives a lower bound, and a second which gives an upper bound on the allowable number of groups from each class. For classes above the m a x i m u m for the par t icular rule set, both lower and upper bounds are set to zero. For the largest class, both lower and upper bounds are given the appropriate value for the rule set, and for classes below the maximum, the lower bound is set to zero and the upper bound is given the value of the total number of groups minus the number of groups from the largest class. The rules can then be wri t ten as the following equations.

Fi rs t of all, an R z 1 vector of binary variables d~ is introduced such that:

~dr- 1 (1) /,

This vector is used throughout the equations to ensure tha t only one rule set r is active at any one time. The total number of groups in a molecule is then given by:

Z: = ,:2) j cl ct r

176

where nj,cZ,ct is defined as a positive integer variable which represents the num- t is the total number of groups ber of groups j which appear in a molecule, and n r

in the rule set r. cl and ct are the class and category of group j respectively, each group is given a unique class and category ass ignment by the following equation:

cl ct

This equation allows nj,d,a to be non-zero only for cl = cl' a n d ct = ct' while for all other combinations of cl and ct, nj,cl,c t must be zero. The allowable number of groups from each class is given by:

ar%' , Vcl c C L (4) j ct r

a~n~' , Vcl E C L (5) j ct r

where _d,,~n and _~l,max are the min imum and max imum numbers of groups /~r "lbr allowed from class cl in rule set r. The numbers of groups from categories 3, 4 and 5 are l imited by:

E Z nj'cl'3 ~-~ Z drT~Crt3 (6) j cl r

j cI r

a~n~ (8) j cl r

where n~ ta, n ct4 and n~ t5 are the max imum group numbers allowed from categories 3, 4 and 5 respectively in rule set r. The numbers of groups from categories 3, 4 and 5 summed together, and from categories 4 and 5 summed together, are similarly limited:

E Z (TtJ'cl'3 .at- TLJ'eI'4 @ TtJ'c/'5) --~ E drTt~t345 (9) j cl r

E Z (nj,cl,4 + nj,d,5) < Z drn~t45 (10) j el r

where n~ t345 and n ct45 are the max imum total group numbers allowed from categories 3,4 and 5 summed together, and from categories 4 and 5 summed together, respectively.

Octet Rule In order to ensure tha t complete molecules have zero valency, the octet rule is

177

introduced:

E E E (2 - vj)nj,d,~t = 2m (11) j cl ct

where vj is the valency of group j (equal to class for classes 0 - 4) and m is 1, 0, -1 or -2 for acyclic, monocyclic, bicyclic and tricyclic compounds respectively.

Addi t iona l S tructura l Res tr i c t ions In addition to the above rules, other restrictions may be introduced on a case by case basis to limit the numbers of co-materials designed. To prevent chemistries in which the co-materials are much simpler or much more complicated than the product, the maximum and minimum number of groups in each co-material can be bounded:

E E E nj,d,ct >_ nmin (12) j cl ct

E E E l'l'max (13) j cl ct

where nmin and nmax are the minimum and maximum allowable numbers of groups. These constraints indirectly restrict chain length in homologous series. More direct constraints can be writ ten by bounding the sums of the numbers of group types in any series.

Since the formation and cleavage of carbon-carbon bonds often requires extreme operating conditions which are likely to disrupt the chemistry of interest, it may be desirable to avoid co-materials which must undergo changes in carbon skeletal s tructure in order to arrive at the product. In general this is difficult to achieve, since co-material design focuses on types and numbers of groups, ra ther than on the connections between them. However, many undesirable materials can be avoided by imposing restrictions on the allowable types and numbers of groups. The numbers of branches, substituents, subst i tuted sites and functional groups may also be limited in this way to avoid co-materials which are significantly more or less structurally complicated than the product. For example, if only monosubst i tuted benzenes are required, the following equations are introduced:

E E T t A C H , cl,ct = 5 cl ct

E E n A C , c l , c t - 1 cl ct

and m is set to zero in the octet rule (equation 11) to allow only monocyclic structures. Additional restrictions can be incorporated in the stoichiometry identification exercise to avoid, or at least further reduce, the generation of chemistries which alter carbon skeletal structures, if required.

178

Objective Function The objective is set as the minimisation of the total number of groups in a molecule:

M i n i m i s e E E E n j , d , c t (14) j cl ct

In this way, co-materials are enumerated subject to the above rules, s tart ing with the simplest first.

S o l u t i o n P r o c e d u r e The above formulation consists entirely of binary and integer variables in linear equations and is therefore an mixed integer linear programming (MILP) problem. In order to generate a set of co-materials, the problem is solved repeatedly with an integer cut written after each iteration to exclude the current optimal group combination from future iterations. However, it is the precise combination of numbers of groups which must be eliminated, not just the combination of group types (excluding group type combinations would eliminate homologous series). In order to do this the binary variable CUTj,t is introduced, which is related to nj,cl,ct as follows:

(15) t cl ct

CUTj,t - 1 ( 1 6 )

t

According to these equations, CUTj,t is non-zero only for t = t' where t' is the number of times group j occurs in a molecule. CUTj,t is zero for all other values of t # t'. The integer cuts are written in terms of CUTj,t.

Note that linear group contribution property prediction equations and bounds may be included in the above formulation without affecting the solution procedure. For example, to exclude co-materials with high toxicity, the following equation could be introduced based upon the lethal concentration (molfl) causing 50% mortality in fathead minnow (LC50):

where dl/j is the toxicity contribution of group j from Gao et al. (1992), and LC5Omin is the lowest permitted LC50. Since LV5Omin is fixed, this equation is linear.

ADDITIONAL MOLECULES To complete any stoichiometry, it may be necessary to include some simple additional molecules, which cannot be systematically designed using the above

179

procedure. A set of simple complete molecules appears as class zero in Con- stantinou et al. (1996). However, further molecules may be required on a case by case basis according to any existing industrial stoichiometries and the type of chemistries to be considered. Examples of such molecules include oxygen, hydrogen, hydrogen chloride or other hydrogen halides, chlorine or other halogen molecules, carbon monoxide and carbon dioxide. A subset of these, or a larger set, may be selected as required as the final step of co-material design.

7.4 STOICHIOMETRY IDENTIFICATION F O R M U L A T I O N

The multistep reaction stoichiometry identification problem can be defined as follows. Given, (i) a desired product and desired production rate, (ii) a set of stoichiometric co-materials, (iii) cost information for each material and group contribution parameters for the corresponding group set (iv) a set of role specification and chemistry constraints and (v) a range of reactor operating conditions, then the objective is to determine a set of candidate multi-step reaction stoichiometries which are promising in terms of both economics and environmental impact.

The model for the identification and economic and environmental evaluation of a single step reaction stoichiometry is presented below, followed by a description of the solution algorithm in which this model is used to develop multistep stoichiometries. The model consists of seven sets of equations; an atom balance, whole number stoichiometries constraints, role specification constraints, chemistry constraints, carbon structure constraints, pure component property prediction equations and a reactor process model.

The sets employed in the model are shown in Table 2.

Table 2: Stoichiometry Identification Model Sets

E elements S species CS ( c S) carbon containing species J chemical groups

The formulation is based on the assumption that chemical species undergo reactions either singly (e.g. thermal decomposition or isomerisation, ignoring any reagent, catalyst or solvent effects) or at most in pairs, so that the number of reactants is limited to at most two. An upper limit is applied on the total number of materials in each stoichiometry (since the number of reactants is limited

180

this effectively limits the number of co-products) and no competing reactions are considered (stoichiometry determination can only develop stoichiometric co- products not side products).

The following additional assumptions are made in the analysis: isobaric reactor operation at known pressure Ptot, gas phase reaction and perfect gas behaviour. Only the products and the reactants are costed, no process equipment or operating costs are considered and the inherent inaccuracies in the property prediction techniques and thermodynamic models employed are accepted.

Clearly, incorporating side reactions will add to the impacts so tha t the present results are lower bounds in this respect. The limits and cuts employed here are practical constraints which can be tightened or relaxed as desired. In principle, the thermodynamic model permits consideration of operation at any pressure. More detailed costing depends on more sophisticated process models.

7.4.1 A t o m B a l a n c e

The s tar t ing point for this work is an atom balance equation which describes the chemistry of a particular set of S species composed of E elements (Rotstein et al., 1982). The atom balance is writ ten as follows"

c~E = 0 (18)

where c~ is the E- S atomic matrix and V~ is the S. 1 column vector of stoichiometric coefficients v~.

It is assumed tha t the rank of the matrix c~ is E. In general, S = E + m, so tha t m represents the degrees of freedom (DOF's) in the system. These DOF's represent stoichiometric coefficients which must be specified in order for the atom balance to be solved. The remaining S - m coefficients are then determined as functions of these. Clearly when m = 0, a unique solution exists, and when m >_ 1, there is an infinity of solutions, corresponding to an infinity of possible stoichiometries.

7.4.2 Whole N u m b e r S t o i c h i o m e t r i e s Con s tra in t s

At the atomic level, chemical species react in whole number ratios so tha t in general, meaningful chemical reactions are writ ten in terms of stoichiometric coefficients which are rational numbers (i.e. whole numbers or numbers which can be expressed as ratios of whole numbers) so that through multiplication by appropriate factors, stoichiometries involving only whole number coefficients can be obtained. In such stoichiometries the product coefficient is a whole number which may be greater than or equal to unity.

In their atom balances, Rotstein et al. (1982), and later Crabtree and E1-Halwagi

181

(1994), assigned the value uni ty to the product stoichiometric coefficient with no restr ict ions on the co-material coefficients. While this does not lead to any loss of generality, it potentially allows the development of an infinity of meaningless solutions in which the co-material coefficients are not rat ional numbers.

In order to ensure tha t only solutions involving whole number stoichiometric coefficients are obtained, the following l inear equations are introduced where vp is the stoichiometric coefficient of the desired product.

N

Xs -- ~ 2 (n-1)bns ,

n = l

vp _> 1 (19)

Vs c S (20)

Assigning �89 >_ 1 allows the necessary flexibility in the value of the product stoichiometric coefficient so tha t there is no loss of generality, x~ is a dummy coefficient which is defined as a positive, continuous variable. For each species s, this variable is expressed as a l inear combination of binary (i.e. 0 -1 ) variables bn~. In this way, the continuous coefficients x~ are constrained to take positive whole number values in the range from zero to an upper l imit de termined by the value of N.

The real stoichiometric coefficients v~ are related to the dummy coefficients x~ as follows:

vs = xs - 2x~ii~, Vs C S

The binary variable ii~ is necessary since the coefficients v~ may take positive or negative values. The variables ii~ take the value zero if species s is a product (v~ positive) and uni ty if species s is a reac tant (v~ negative) so tha t ii~ is the r e a c t a n t flag. This equation may be linearised using the Glover (1975) transformation, yielding:

v s = x s - 2 . y ~ , V s c S

y~ - ?)max �9 iis < 0, Vs C S

xs § Vmax(ii~ -- 1) -- ys _ 0, Vs E S

y ~ - x ~ _ 0 , V s c S

(21)

(22)

(23)

(24)

where y~ is a dummy variable for the product x~ii~ and Vmax is the m a x i m u m permi t ted m a g n i t u d e for any stoichiometric coefficient. The variables y~ are defined as positive continuous variables. To ensure tha t they take non-zero values only when species s is a reactant , the following additional constraint is applied:

ys >_ iis, Vs c S (25)

182

Note that for any particular stoichiometry, xs and vs are non-zero only for the species involved and zero for all other species, while ys is non-zero only for the reactants involved and zero for all other species (including products and co- products).

7.4.3 Role Specification Constraints

Role specification constraints (Fornari et a l . , 1994a, 1989) are used to restrict the participation of molecules in the stoichiometries; for example, to avoid certain stoichiometric co-products or to define a species as a raw material only.

In order to apply such constraints the raw materials and products in any stoichiometry must be identified. Raw material identification is taken care of by the binary reactant flag i is, from the whole number stoichiometry constraints. Products are identified using the following equations:

xs -- ys -- Vmax " Is <_ O, Vs C S (26)

x s - - y s - - ( V m a x + l ) ' I s + V m a x > _ O , V s c S (27)

where Is is a vector of binary elements is. Together with equations 20 and 21-24, these relationships assign the value zero or unity to is when the stoichiometric coefficient vs is negative or positive, respectively. Thus, is is the p r o d u c t flag.

In order to relate the raw material and product flags, a third flag i i is is introduced which takes the value zero if species s is a raw material or a product, and unity if species s is not involved in the stoichiometry. The three flags are related as follows:

is § iis + ii is = l , Vs C S (28)

The role specification constraints are then posed simply by specifying the values of the flags in advance. For example, to define species s as a raw material only, it is excluded from being a co-product by setting is = 0. The role specification constraints may be written differently for different stoichiometry steps. The full list of the constraints used in the example presented later in this chapter is presented in Appendix A.

7.4.4 Chemistry Constraints

In addition to the role specifications, the binary flags are employed to develop knowledge based chemistry constraints. These are used to restrict the number of reactants and products involved in any stoichiometry, and to eliminate certain chemistries.

According to Holiastos and Manousiouthakis (1998) an elementary reaction can involve up to three reacting molecules and, if the reaction is to be reversible,

183

up to three product molecules. Furthermore, since the formation or cleavage chemical bonds which occurs during an elementary reaction requires the or- bitals of reacting molecules to come sufficiently close together and be correctly oriented, elementary reactions involving two reacting molecules are more likely than those involving three purely on statistical grounds.

According to the same ideas, the number of different reacting species in any stoichiometry is here limited by the following:

E iis _< _,.N ma~ (29) 8

where Nr max is a problem specific maximum number of reactants. In the example presented in this chapter, N ~ ax is assigned the value two, which effectively eliminates side reactions except in the unlikely event of a simultaneous isomerisation.

A problem specific upper limit on the total number of species N~ ax involved in any stoichiometry is also imposed according to the number of different species involved in the most complex step of the existing industrial routes to the product of interest.

~ ( i ~ + ii~) < N'~pm~ ~ (30) 8

Since there must be at least one reactant, this constraint limits the number of products to at most N~ a~ - 1.

Note that these constraints limit only the numbers of different species involved in any stoichiometry, not their stoichiometric coefficients. However, all stoichiometric coefficients are constrained to be less than ~a~, and can be further constrained by introducing the following equation if required:

M s ~ l /y ax, VS C S (31)

where uy~ is the maximum permitted stoichiometric coefficient of species s.

Other knowledge based chemistry constraints may be imposed directly on certain species. The following examples are provided for illustration, the full list of the chemistry constraints employed in the illustrative example problem are presented in Appendix A.

�9 s p e c i e s a a n d b must not react together

iia + iib <_ 1

�9 s p e c i e s a may only react with species b or species c

i i a - (iib + iic) <_ 0 (32)

184

�9 species c may only be produced by reacting species a and b

2ic - (iia + iib) < 0 (33)

In order to identify a set of candidate stoichiometries, the stoichiometry selection problem must be solved iteratively, with integer cuts to exclude previous solutions. The binary flags provide the mechanism for this. Since the same combination of reactants may be involved in several stoichiometries in which the product is derived from the same underlying reaction but with redistribu- tion of the co-products, the integer cuts are written to exclude only the combinations of raw materials observed in the solutions. In this way, such redundant solutions are avoided.

7.4.5 C a r b o n S t r u c t u r e C o n s t r a i n t s

According to section 3.2, constraints may be needed to prevent chemistries in which carbon-carbon bonds are broken or formed. However, since the atom balance contains no structural information it is not possible to write such constraints directly. Furthermore, since the need to break or form carbon-carbon bonds depends on the set of co-materials and the nature of the chemistry to be considered, carbon structure constraints can only be developed on a case by case basis. Moreover, the development of general constraints is hampered by the fact that only the final product is known in advance (it is not known which co-materials will be reactants or co-products).

Despite these difficulties, general constraints which infer certain restrictions on carbon structural changes are possible, and under certain special circumstances, carbon structure changes can be eliminated.

Noting that ys is non-zero only for reactants and xs - y~ is non-zero only for products, the following constraint may be used to prevent a ne t change in the number of carbon-carbon bonds in a stoichiometry:

~ - Ncb y~ ~ = ~ ( x ~ - y s ) g : b (34) 8 8

where N2 b is the number of carbon-carbon bonds in species s. The ne t gain or loss of carbon-carbon bonds may be allowed by writing this constraint as an inequality.

For stoichiometries involving straight chain acyclic molecules in which the carbon skeleton is uninterrupted, the following prevents any change in the carbon skeleton (and permits only one carbon containing reactant and no carbon containing co-products):

�9 �9 c b Npb ~c~N~ = , Vcs C C S (35)

185

where N~ b is the number of carbon bonds in the product. Clearly, this is an ex- t remely restrict ive constraint. Less restrictive constraints can be wr i t ten for the same type of chemistry. For example, the f o r m a t i o n of carbon-carbon bonds for stoichiometries involving straight chain acyclic molecules with un in te r rup ted carbon chains can be prevented by considering the relat ionship between the stoichiometric coefficients of the reactants and the product, if only one carbon containing reac tan t is allowed. Consider the production of a product wi th a single carbon-carbon bond from reactants containing up to six such bonds. Al- lowing only a single carbon containing reactant and disallowing the formation of carbon-carbon bonds, the following reaction schemes are permitted:

C-C --+ C-C

C-C-C --+ C-C+C

C-C-C-C ~ 2(C-C)

C-C-C-C-C ~ 2(C-C) + C

C-C-C-C-C-C ~ 3(C-C)

C-C-C-C-C-C-C -+ 3(C-C) + C

while schemes such as:

2(C) -~ C-C

2(C-C-C) -~ 3(C-C)

2(C-C-C-C-C) -~ 5(C-C)

2(C-C-C-C-C-C-C) -~ 7(C-C)

are not.

These reaction schemes imply tha t in an allowable stoichiometry, the following relat ionships between vp and yes mus t be obeyed if species cs is selected as a reactant , according to the ratio of the number of carbon-carbon bonds in the reac tan t to tha t in the product:

If 0 < N2 _ ~ < 1 then v~s=0 Pb

If 1 < Ncs~ < 2 then vp _ ~ _ - - - - V c s

Pb

If 3<N~i < 4 then vp -2v~s

If 5<~--~ < 6 then vp -3v~ N~9 b - - - --

Since the na ture of the relationship between vp and v~s depends on the bond ratio, the relat ionships are in fact quite general and can be applied to acyclic products wi th un in te r rup ted carbon chains featuring any number of carbon- carbon bonds. In addition, they can be extended for reactants wi th any number of such bonds. Clearly, a problem arises if N~ b = 0, however this can be overcome by introducing the binary variable p which takes the value of zero when N~ b > 1 and uni ty when N~ b = 0 according to:

1 - p < N~ b <__ ( 1 - p ) N cb m a x (36)

186

and cb N~a ~ is the m a x i m u m number of carbon-carbon bonds fea tured in the species from the set CS. Incorporat ing p, the vp to vc~ relat ionships are embodied in the following general constraints"

E ( 2 t - 1 ) q t , c ~ < [ N~b ] < 2 E tqt,cs + O.99Zcs, Vcs c CS (37)

Vp(1 -- p) -- (1 -- i ics)Vma x ~ (Zcs --~ E tqt'cs)Ycs ~- ( 1 - - Zcs)V p -~- pVmax, VC8 e C S (38) t

where qt,c~ and z~ are integer var iables such that :

Z~s + E qt,~ = 1, Vcs c C S (39) t

so t h a t for each species cs only one of z~ and the vector of qt,~ binary var iables can t ake the value unity. Note tha t ycs is used in equat ion 38 so t ha t the con- s t r a in t s affect only react ing species.

To u n d e r s t a n d how these const ra ints function, consider the following:

�9 When N~ b - 0 and N~ = 0, p - 1 from equat ion 36, and in order t ha t equat ion 37 be obeyed z~ = 1 and all qt,c~ = 0. Thus since y~ is a positive variable, whe the r i i~ is zero or one, 0 <_ y~ < Vmax from equat ion 38, which imposes no addit ional restr ict ion on y~s.

�9 When N~ b = 0 and N~C~ _ 1, p = 1 from equat ion 36, and in order t h a t equat ion 37 be obeyed zcs = 0 and qt,,cs - 1 (and all qt#t,,~ = 0). Thus, w h e t h e r iic~ is zero or one, 0 __ t'yc~ <_ v, + Vma~. This potent ial ly res t r ic ts y~s, bu t since chemistr ies in which products with no carbon-carbon bonds are developed by breaking up reac tan ts with such bonds are not of in te res t here, this l imitat ion is acceptable.

�9 When N~ b _ 1 and N~C~ < N~ b, p = 0 from equat ion 36, and in order t h a t equat ion 37 be obeyed Zcs = 1 and all qt,~s = 0. Thus, y~ _< 0 from the upper bound in equat ion 38 so t ha t ii~s mus t equal zero for the lower bound to be feasible. This means tha t all species with fewer carbon-carbon bonds t h a n the product are not permi t ted as reactants .

�9 When N~ b >_ 1 and N~C~ >_ N~ b, p = 0 from equat ion 36, and in order t h a t equat ion 37 be obeyed zcs = 0 and qt,,cs = 1 (and all qt#t,,~ = 0). Thus if i i~ = 0, 0 < t 'y~ <_ �89 or i f iic~ = 1, vp <_ t'ycs <_ vp. This means t h a t if species iic~ is not a reactant , there is no addit ional l imitat ion on y~, but if species cs is a reactant , yc~ and therefore V~s m u s t be re la ted to vp in one of the ways prescribed above, depending on the bond ratio.

187

Note tha t once the product is known, equation 36 can be solved for p, so tha t equations 37 and 39 can be solved for zcs and qt,cs, p, z~s and qt,cs can then be entered as parameters in equation 38 which becomes linear as a result.

7.4.6 Thermodynamic and Environmental Property Equations

The enthalpy of formation and Gibbs Free Energy of formation of each species is required to estimate the enthalpy and Gibbs Free Energy of reaction in the process model. Pure component heat capacities are required for the energy balance and pure component toxicity is also needed.

ENTHALPY OF FORMATION According to Perry and Green (1984), the enthalpy of formation at 298K AH}~ s of species s in kJ/mol can be found using the group contribution scheme of Verma-Doraiswamy:

AH}~S = 4.1868EnsjSH~'29s, Vs C S (40) J

where 5 Hs'~"~ is the contribution of group j (in kcal/mol) from Perry and Green (1984).

GIBBS FREE ENERGY OF FORMATION Also from Perry and Green (1984), the Gibbs Free Energy of formation AG/s(Tope~) of species s in kJ/mol can be estimated using the group contribution techniques of Van Krevelen and Chermin (1951) with accuracy of +21 kJ/mol:

{ a~ r [~nsy~S+ (--)]Tope~}, VsES (41) Aa/s(Toper)=4.1868 E nsj. + Rln as j ?Ts

where Tope~. is the reactor operating temperature, a~F and ~ r are the group contributions of group j in kcal/mol (from Perry and Green, 1984), R is the gas constant, as is the symmetry number of the molecule (the number of independent orientations which appear identical to an observer) and ~s is the number of optical isomers. For molecules with no symmetrical orientations or optical isomers, as and ~s are assigned the value of unity to avoid numerical problems.

~ r a n d / ~ F are valid for temperatures between 300K and 1500K, so that Toper

is bounded as follows"

300 _< Toper <_ 1500 (42)

HEAT CAPACITY AND TOXICITY The ideal gas molar heat capacity C yap (J/mole K) of species s is estimated using v p,i

the following polynomial equations:

188

CpVap _ j ( ~ n,jAJb 0.21)T ( ~ n ~ j A c (43) ,s ( E Ttsjna- 37.93)-~- -~- -]-- J J / J

--3.91 • 10-4)T 2 + ( E nsJAJd + 2.06 x 10-7)T 3, V8 C S J

where the coefficients A {, A j, AJ and A ( are group contribution pa ramete r s from Joback and Stephanopou~os (1989). ~

In order to measure the short- term environmental impact of any mater ia l re- l eased , the toxicity of each species s is es t imated using the group contribution techniques of Gao et al. (1992):

-logLC50~ = ~ oljnsj, V8 E S (44) J

7.4.7 R e a c t o r Proces s Model Equat ions

The process model is based on a single reactor in which chemical equil ibrium is achieved. Unreacted raw mater ia ls are recycled assuming tha t they are sepa- ra ted cleanly from the products and ignoring, for the present, the necessary separat ion technology. The chemical equilibrium position is located by minimising sys tem Gibbs Free Energy, and since this position is independent of the reactor, it is not necessary to prepostulate the reactor type (e.g. PFR, CSTR etc.}).

COMPONENT MASS BALANCES The mass balance around the reactor is wri t ten as follows for all species in the set S:

n~; - n~i = crv~, Ys c S (45)

where n~ is the total molar feed flow of component s to the reactor, nsy is the molar flow of component s in the reactor effluent and cr is the extent of reaction. Recalling tha t x~ is non-zero for all species involved in the reaction and ys is non- zero only for the reactants and taking a flow basis of 10 kmol/hr, the following restr ict ions are wri t ten for n~ and nsy:

nsi = 10y~, Vs c S (46)

nss <__ 10x8, Vs E S (47)

According to equation 46, nsi is non-zero only for the reactants in a par t icular stoichiometry, and according to equation 47, n8 S can be non-zero only for the components involved in a part icular stoichiometry. For all other components in the set S, both n~i and nss are zero. To ensure all reactions exhibit acceptable conversion, the extent is bounded as follows:

~o (48) ~r ~ Cr

189

For the reactants, the product -c~vs represents the consumption rate in the reactor and therefore the fresh feed demand. For the products, E~vs represents the production rate. Thus, the fresh feed demand and production rate in kmol/hr are given by:

F~ = -~rv~ii~, Vs E S (49)

P~ = G~v~i~, Vs C S (50)

ENERGY BALANCE It is assumed that the fresh feed enters the reaction block at 298K and the products leave the block at the reactor operating temperature Top~. The reactor energy demand therefore has two contributions, one from fresh feed pre-heat and one from the heat of reaction. The heat of reaction is estimated in three steps. First, the total feed (comprising fresh feed and recycled reactants) is cooled from the operating temperature to 298K, the reaction is then performed and finally the entire reactor effluent is reheated to the operating temperature.

The reactor energy demand Qreactor, in kJ/hr per mole of product is given by:

~[ov~ f29S QrcactorPp = E ( - crvsiis) C'p,s dT + Cp,sVap dT

S 8 s J Toper ~ roper yap

+1000r § E nsf Cp,s dT 9 8 8

where Pp is the production rate of desired product and C~,~ p is the vapour heat capacity of component s. Substituting for n~f from equation 45, this reduces to:

= l{ lo00r "T~ Cp,svapdT} (51) 98

The heat of reaction at 298K AH~ 98 in kJ/mol is estimated from:

_ ( 5 2 )

8

SYSTEM GIBBS FREE ENERGY The Gibbs Free Energy of the reaction system G~y~ (in kJ/hr since n~f is a molar flow) may be estimated from the following expression (for perfect gases or real gases at low pressure):

Gsys 1000 = E nsfAGfs-4- RTo;cr E nfsln [ nfsPoper

Differentiating this expression with respect to n~f at constant temperature and pressure leads to the condition for chemical equilibrium:

A G R = - RTope~lnK (53)

190

where R is the gas constant, and the Gibbs Free Energy change of reaction per mole of product AGR in kJ/mol and the reaction equilibrium constant K are given by:

AGR : ~ v~AG y~ 8

K - IL p e ~ ns ~

(54)

In order that the chemical equilibrium condition be obeyed, a reaction with a large positive or negative AGR must exhibit a very small or a very large K respectively. A very small K implies very small n~f values for the products, while a very large K implies very small nsf values for the reactants. In some cases, these nsf values are so close to zero that the optimisation problem becomes poorly scaled. Introducing an n~f lower bound does not solve the problem since any such bound may render the equilibrium condition infeasible.

Thus, ra ther than imposing the chemical equilibrium condition, Gsy~ is minimised directly instead, with an n~f bound in place to prevent scaling problems. While this prevents reactions with large positive or negative AGR from achieving equilibrium, since the n~f lower bound is small, the solutions are barely affected.

The nfs bound is written as follows, with the binary flag iiis introduced so that nf~ takes the value zero for all species not involved in the stoichiometry of immediate interest:

ny~_> 1 x 10-4(1-iii~), V s E S (55)

Since n~y is zero for some s, iiis must also be introduced in to the G~y~ expression:

Gsys i000 : E nsfAGfs

8

+RTop~r[~nfsln((ny~+iii~)Poper)-En~yln(En~yPe)]s ~ (56)

Crabtree and E1-Halwagi (1994) use a similar approach to deal with species not involved in a particular stoichiometry, although they reported using the reaction equilibrium condition to determine the reaction equilibrium position.

Note tha t the chemical equilibrium condition provides a unique relationship between n~f and Toper, whereas in the G~y~ expression they are independent variables. Thus an additional temperature bound is required to ensure that temperature is consistent with the extent bound from equation 48. This bound is calculated by solving the following equation for T':

AGR(T') = -RT'lnK~o (57)

191

go where K~zo is the equilibrium constant evaluated at cr = er "

lnK~z~ = E v~ln ((n~ + v~c~~ + - E vsln ( ~ (nSi + v ~ P e ) )

The reactor operating temperature must then satisfy:

Toper > r'

(58)

(59)

THERMODYNAMIC AND ECONOMIC CONSTRAINTS The Gibbs Free Energy of reaction per mole of product is employed to eliminate thermodynamically infeasible solutions, using a 10 kcal/mol (or 41.868 kJ/mol) upper limit as follows:

AGR < 41.868 (60) vp

The profit associated with each reaction is calculated as follows, assuming that any stoichiometric co-products are sold at their market value:

Profit = ~ s vsCs (61) vp

where C8 is the market value of species s using Chemical Prices (1998). Note that individual reaction steps cannot be rejected on the basis of profit since the profit of any one step is not representative of the profit of the entire chemistry.

ENVIRONMENTAL CONSIDERATIONS The environmental impact directly associated with carrying out each stoichiometry is assumed to arise only from the energy consumption necessary to maintain reactor temperature. By-products are not considered and it is assumed that there are no material emissions of the co-products of any stoichiometry. In addition, the impacts associated with separating the products from the recycle, and with all other downstream processing are ignored for the present. In these respects, the impact figures calculated here are very much lower bounds for the eventual process impacts. This simplistic t reatment of environmental impact assessment reflects the level of information available at stoichiometry selection.

In principle, the full range of life cycle assessment (LCA) based metrics available within the MEIM could be used to develop a full impact vector for each stoichiometry. However, since air emissions are the dominant form of energy associated waste, the critical air mass (CTAM) metric is chosen. According to Stefanis (1996), the critical air mass associated with energy production is 1.629 • l0 s kg air/MWh. Assuming that the environmental impact per unit energy of maintaining reactor temperature is the same as that of burning fossil

192

fuels to produce electricity, the environmental impact arising from the reactor energy demand per mole of desired product is:

CTAME(kgair/hr)=l'629• V/(Qr~act~ )3600 (62)

Note that it is assumed here for simplicity that the impact of cooling the reactor (which is necessary for negative Qr~actor) is the same as that of heating it. This is a simplistic assumption, however, in this way reactions which require with- drawal of energy are equally penalised in environmental terms as those which require energy supply.

This assumption is made on the basis that reactions which require energy with- drawal are likely to be exothermic reactions occurring at moderate temperatures, so that the heat of reaction term dominates the energy balance (equation 51). This is only likely to occur towards the reaction temperature lower bound (i.e. 300K) at which the reactor temperature is too low to use cooling water at ambient temperature. Thus, some kind of refrigeration would be required which carries with it a high energy demand and therefore a high impact, associated with compression requirements.

In order to complete the impact assessment, the input wastes associated with the materials consumed in any stoichiometry must be included. However, the quantification of the input waste of any material can be a lengthy exercise, since all processing steps necessary to produce the material from natural ly occurring substances must be considered in accordance with the principles of LCA (Hei- jungs et al., 1992; ISO 14040, 1997; SETAC, 1993). Thus, ra ther than performing this exercise for all co-materials, it is more efficient to assess the input wastes only of those materials which are identified as raw materials by the multi-step stoichiometry identification formulation (i.e. those materials with no precursors). Since these materials are not known at the outset, input waste assessment can only be performed after stoichiometry identification.

Provided input wastes are included in this way, consistent impact figures can be obtained for multi-step stoichiometries involving branches of different lengths, and different stoichiometries can be compared on a consistent basis.

7.5 SOLVING THE MULTISTEP STOICHIOMETRY IDENTIFICATION P R O B L E M

7.5.1 O v e r v i e w

It is desirable to use the above model to enumerate and evaluate multistep stoichiometries simultaneously and implicitly within a framework of constrained

193

optimisation. However, the optimisation objective (minimise Gsus) is not suitable for a such an approach since it must be applied to each individual stoichiometric step and furthermore, even for a single step stoichiometry, the model is a large mixed integer nonlinear programming (MINLP) problem.

It involves a large number of optimisation variables, including for each species s: the binary variables is, iis, iiis, bns (Vrt) and also the binaries qt,es (Vt), Zcs and p if the carbon structure constraints are employed, and the continuous variables vs, xs, ys, n4, nsi, Fs, Ps and AGfs. Furthermore, the relationships between these variables are not trivial, including many instances of products between binary and continuous variables.

Thus, in order to solve the multistep stoichiometry identification problem, a decomposition based approach is adopted in which the single step problem is solved by explicit enumeration and subsequent evaluation of stoichiometries in two sequential steps. This procedure is then applied successively in an algo- r i thm designed to build up multistep reaction stoichiometries. The enumerat ion and evaluation of single step stoichiometries are discussed below, followed by a description of the multistep stoichiometry identification solution algorithm.

SINGLE STEP STOICHIOMETRY ENUMERATION The basic single step stoichiometry enumeration formulation consists of equations 18 - 33. Carbon structure constraints (equations 3 4 - 39) are optional and are included on a case by case basis. With the exception of the more complicated carbon structure constraints (equations 37 and 38) all equations are linear. However, as discussed above, provided equations 36, 37 and 39 are solved in advance, equation 38 becomes linear. Thus, with or without carbon structure constraints, the single step stoichiometry enumeration problem can be formulated as an mixed integer linear programming (MILP) problem so that optimal solutions can be guaranteed.

Recognising that simple stoichiometries with few reactants and co-products are more attractive than complex ones (which in general require more complex reaction and separation technologies) a new objective function is introduced in order to extract stoichiometries systematically from the matr ix _% start ing with the simplest first. The number of materials involved in any stoichiometry Nspe is obtained by summing the reactant and co-product flag values, so tha t this objective is writ ten as follows:

minimise E{(i~ + ii~) = N~p~} (63) 8

In order to identify a set of candidate stoichiometries this problem must be solved repeatedly, with integer cuts introduced at each iteration to exclude previous solutions. Accordingly, the simplest stoichiometries are enumerated first

194

and as cuts are added the solutions become progressively more complicated.

SINGLE STEP STOICHIOMETRY EVALUATION The single step stoichiometry evaluation problem consists of the property prediction and reactor process model, equations 40 - 62. For each stoichiometry, this model is solved immediately after the stoichiometry enumeration model. Thus, vs, xs, ys, is, iis and iiis are known and are treated as parameters in the stoichiometry evaluation problem which is then reduced to an nonlinear programming (NLP) problem. The optimisation objective is to minimise Gsy~ and the main optimisation variables are Top~, ~ and n/s.

7.5.2 Multistep Stoichiometry Identification Algorithm

OVERVIEW OF ALGORITHM In order to generate multistep stoichiometries the single step stoichiometry enumeration and evaluation problems are solved successively using a depth first enumerat ion strategy, in which the desired maximum number of reaction steps is specified in advance. The operation of the algorithm is schematically depicted in Figure 1 for the case where at most three reaction steps are allowed.

Sys tem 2B ~ ' ~ s m , , ,

Evaluation I ', ,

. . . . . . . . . . . . . . . . . . . . .

: ~ / s t e = - - 1 " - - 0 ~ . . . . . .

] l o-,

. . . . . . . . . . t~? ......

- Eva don " . . . . 1 , [ Eva tion r . . . . . . .

~A,SaBUE ' ~A~a~

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

System 214 'Sys tem 2C 'Sys tem 2D ~ B " . . . . . iion l" ~ ' ~ [ E . . . . . . tion I ,

/ . . . . . , : ........ |

. . . . . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . . . . . . . . . . ~ . . . . . . . . . . . . . . . . . . . . .

Figure 1: Multistep Stoichiometry Identification Algorithm

At the first level, system zero, the final desired product is the target molecule, and a single stoichiometry involving up to two first generation precursor reactants is extracted from the matrix e s. One of these first generation precursors is then arbitrari ly selected as the target molecule for system 1A and a single sto-

195

ichiometry involving up to two second generation precursor reactants is identified which leads to this compound. One of these second generation precursors is then selected as the target molecule for system 2A. Since system 2A completes this branch of the enumeration exercise, it is solved iteratively until all stoichiometries leading to this second generation precursor target molecule have been enumerated. Once this has been achieved, system 2B is solved iteratively for all stoichiometries leading to the other second generation precursor.

With this pair of second generation precursors completely fathomed, System 1A is run again to generate two more, which are then treated as the target molecules for system 2A and system 2B. This process is repeated until all stoichiometries leading to the first generation precursor target molecule have been enumerated. Once this has been achieved, systems 1B, 2C and 2D are employed to enumerate all stoichiometries leading to the other first generation precursor. System zero is then solved again to generate two more first generation precursors, and the whole procedure is repeated until all stoichiometries leading to the final desired product have been enumerated.

In this way, multistep reaction stoichiometries are developed in which a family tree of precursors are linked to each other by individual reaction stoichiometries which lead eventually to the desired product. In principle, this approach may be applied to generate any number of successive reaction steps.

Each system comprises of both stoichiometry enumeration and evaluation, so tha t for each stoichiometry, the evaluation problem is solved immediately after the stoichiometry is generated. Infeasibility in the linear stoichiometry enumeration problem implies that no stoichiometries exist which lead to the particular target molecule, while infeasibility in the non-linear stoichiometry evaluation model implies violation of the thermodynamic constraints (ignoring numerical problems) for a particular stoichiometry. Thus, if the stoichiometry enumerat ion model is initially infeasible or becomes infeasible after all possible stoichiometries in a particular system have been enumerated, the algorithm immediately moves onto the next system. If however, the stoichiometry evaluation model is infeasible for a particular stoichiometry, the algorithm continues to enumerate stoichiometries within the same system. The results of the system are stored only if both problems are feasible.

The same occurrence matr ix _% and variables are used throughout all systems, so tha t each time a system is solved all variables are over written. Thus, parameters are employed to store results and to communicate variable values between stoichiometry enumeration and evaluation problems within each system. Dif- ferent role specification constraints, chemistry or carbon structure constraints may be included in different systems, if required, simply by including different equations in the model definitions. Otherwise, the same equations and models are also employed throughout all systems.

196

INTEGER CUTS Within the algorithm, integer cuts are automatical ly wri t ten each t ime a system is solved. The cuts are wri t ten in such a way tha t they prevent the same stoichiometry from occurring again both within the current system and within all subsequent systems. Furthermore, they are wri t ten to prevent the reappear- ance of the stoichiometry in both forward and reverse directions. In this way, each mater ia l which appears as a reactant in the entire mult is tep stoichiometry network is fathomed only once and circuits, in which a reaction appears in both forward and reverse directions within the same multi-step stoichiometry, are avoided.

The integer cuts are wri t ten in terms of the reactant and co-product flags in much the same way as the chemistry constraints, and since the same reac tan t and co-product flag variables are used for each system, it is a simple ma t t e r to write these constraints in such a way tha t once writ ten, they are included in all subsequent systems.

TARGET MOLECULE IDENTIFICATION At the outset, only the desired product is known so tha t a means of communi- cating ta rget molecule identities between the subsequent systems is required. This is achieved by using the iis values from each system as lower bounds for the stoichiometric coefficients in the next. If species k and 1 are the precursor reac tan ts generated by a certain system, iik and ii~ will take the value uni ty for this system. These species must then be identified as the products of the pair of subsequent systems. However, species k and 1 mus t be considered independently, i.e. one in one system and one in the other. In order to do this, the vector IIs mus t be split so tha t in one system ik = 1 while iz (and all other product flags) are unconstrained, whereas in the other system il = 1 while ik (and all other product flags) are unconstrained. This is achieved by incorporating the following equations in all stoichiometry enumerat ion formulations:

iis = as + bs, Vs C S (64)

as -- 1 (65) s

bs _< 1 (66) s

Equat ion 64 splits the vector IIs from each system into two vectors As and Bs, of which the elements as and bs are binary variables. According to equations 65 and 66, As and Bs may have only one non-zero entry each, so tha t the non- zero entr ies in IIs are divided, one in to As and one in to Bs. The values of As and Bs are then communicated to the subsequent pair of systems through the pa r ame te r vectors APs and B G respectively, by replacing equation 19 with one of the following equations in all systems subsequent to system zero:

is >_ aps, Vs c S (67)

197

i~ > bp~, Vs c S (68)

Note tha t equation 66 is written as an inequality to permit stoichiometries in which there is only one reactant (e.g. isomerisation or thermal decomposition). In such cases, all elements of the vector B~ take the value zero and the algorithm omits the entire branch of corresponding subsequent systems.

Note also that %, the stoichiometric coefficient of the target molecule for each system subsequent to system zero, which is needed in the stoichiometry evaluation equations, is identified using one of the following equations:

vp - E ap~v~ (69) 8

vp = E bp~v~ (70) 8

where ap~ and bp~ are the parameter values generated by the previous system, and v~ are the stoichiometric coefficients of the current system. For all systems, vp is included as a parameter in stoichiometry evaluation.

CALCULATION OF FINAL RESULTS The profit and impact from each system are stored as parameters immediately after the system is solved. These figures are calculated per mole of the target molecule produced in the current system. The total profit and impact associated with the multistep stoichiometries are calculated per mole of the final product, by s tar t ing at the final systems and working forward towards the final product, adding the profits and impacts sequentially.

However, the target molecule in a certain system may exhibit a stoichiometric coefficient with any value (subject to Vma~) in the previous system where it appears as a reactant, and the product of this previous system may also exhibit any such stoichiometric coefficient value. Thus, the profit and impact of each system must be multiplied by the magnitude of the stoichiometric coefficient of its target molecule as it appears in the previous system, and divided by the magnitude of the stoichiometric coefficient of the product of this previous system, before being added to the profit or impact of the previous system. Since each system may have up to two immediate subsequent systems, the profits and impact of both subsequent systems must be treated in this way. In addition, each combination of subsequent system stoichiometries must be considered and a separate profit and impact figure calculated for each.

Depending on whether ap~ or bp~ is used in a particular system, xp, kk-1 the magnitude of the stoichiometric coefficient which the target molecule of system k exhibits as a reactant in system k - 1 is given by one of:

k - 1 XP, k --- E k-1 k-1 ap~ x~ (71) 8

198

k-1 E bPsk-l-k-1 (72) X p, k ~ - :I; s

8

k-1 where ap~ -1 and bp~ -1 are parameter values from system k - 1 and x s are the stoichiometric coefficients of system k - 1. Thus, the profit and impact of system

k-1 and divided by k-1 before being added to the profit and k are multiplied by xp, k xp impact of system k - 1.

Using these parameters, profits and impacts are cascaded through the multi- step stoichiometry network and a total profit and impact figure is arrived at for each set of stoichiometries which eventually leads to the final product.

Note that each reactor is assumed to be fed at ambient conditions. It is assumed that any cooling which may be required between successive reaction steps to achieve this will be accommodated by energy integration at a later stage of the process design, with no environmental penalty.

7.6 APPLICATION

7.6.1 Case Study: Production of 1-Naphthalenyl Methyl Carbamate

1-naphthalenyl methyl carbamate, also known as carbaryl was employed as a pesticide (Kalelkar, 1988; Shrivastava, 1987; Worthy, 1985). It was manufactured under the trade name SEVIN by Union Carbide India, Limited (UCIL) in Bhopal until December, 1984 when production was terminated following the Bhopal disaster. UCIL's process involved the raw materials 1-naphthol and methyl isocyanate, a toxic substance with a permissible exposure limit (PEL) of 0.02ppm (AGCIH, 1977; Dagani, 1985). Under disputed circumstances, 45 tons of methyl isocyanate underwent a chemical reaction and were released, killing approximately 2,500 people in the vicinity of the plant and resulting in some 300,000 additional casualties.

Crabtree and E1-Halwagi (1994) considered this example with the objective of identifying stoichiometries with more innocuous raw materials, to reduce the potential impact of fugitive emissions. The approach employed here is somewhat different in that the objective is to identify stoichiometries which exhibit low environmental impact under normal operating conditions. While materials with high toxicity can be excluded at the co-material design stage by including equation 17 in the co-material design formulation, this potentially excludes stoichiometries which could be environmentally promising provided proper conta inment were employed. Thus, no such limit was included in this example. In cases where fugitive emissions are of concern, the methodology for environmental risk assessment of non-routine industrial releases presented by Stefanis and Pistikopoulos (1997) could, in principle, be incorporated as part of stoichiometry

199

evaluation.

GROUP PRE-SELECTION According to Worthy (1985), there are two accepted industrial routes to carbaryl, which can be produced with or without methyl isocyanate. The alternative chemistries are shown in Figure 2.

Methyl Isocyanate Route

CH3NH 2 § COCI 2 > CH3--N--- ~ O

Methyl Amine Phosgene Methyl Isocyanate

§ 2 HCI

O m H

1-Naphthol

§ C H 3 m N - - ~ O

O - - C ~ N ~ CH 3 II O

Carbaryl (1-Naphthalenyl Methyl Carbamate)

O - - H

i C l

II O

Non-Methyl Isocyanate Route

§ COCI 2 >

. iC l

II O

1-Naphthalenyl Chloroformate

CH3NH 2 >

I N ~ CH 3 O---C II O

Carbaryl

HCI

HC!

Figure 2: Carbaryl Production Routes

For simplicity, to limit the size of the co-material design problem in this illustrative example, the group set is restricted to the simplest set of groups which are required to form the product and industrial co-materials shown in Figure 2. The selected set of UNIFAC groups (eleven in all) then consists of the aromatic groups AC, ACH, ACC1 and ACOH, and the groups -CH3, CH3NH-, CH3NH2, -CO0-, -CHO, -OH and -C1. Note that methyl amine (CH3NH2) appears as a class zero group in Constantinou (1996), that is as a complete molecule, so that the NH2 group is not required. Note that the -C1 group is a category two group.

200

CO-MATERIAL DESIGN Using this group set, the co-materials were then constructed by solving the co-material enumerat ion formulation once for acyclic molecules and once for aromatic molecules. Additional s t ructural restrictions were included, according to the s t ructures of the industr ial co-materials: (i) for non-aromatic molecules an upper l imit of two groups was imposed, (ii) for aromatic molecules an upper l imit of twelve groups was imposed since it is unlikely tha t carbaryl (which contains twelve groups) would be synthesised from a more complex molecule, (iii) only unsubs t i tu ted or monosubst i tuted aromatics which contain the double r ing (naphthyl group) aromatic s tructure were allowed (since the product conta ins the naphthyl group is monosubsti tuted) by specifying a min imum of seven ACH groups, and a total of ten ACH and AC groups altogether, and (iv) only one subs t i tuen t group with a carbon free a t t achment was allowed in the aromatics. In addition, all non-aromatic molecules containing carbon bonds were screened out after enumerat ion. Thus, chemistries in which the naphthyl group structure is constructed or decomposed, or in which any other carbon-carbon bonds are formed or broken, are avoided.

For the acyclic molecules, constraints were included to prevent chlorine bonding with i tself or with any groups of higher category. However, for the aromat ic molecules, these constraints were removed, to allow the formation of 1- naphtha lenyl chloroformate. The results of co-material enumerat ion are shown in Figure 3.

H I N~ CH 3

1) Naphthalene 2) 1-Chloronaphthalene 3) 1-Naphthol 4) N-Methyl-l-Naphthylamine

OH . i CI I N~ CH 3

II II II O O O

5) 1-Naphthalenyl 6) 1-Naphthalenyl 7) Carbaryl Hydroxyformate Chloroformate

CI 2 CH3C 1 CH30 H CI----C~ H I~

8) Chlorine 9) Chloromethane 10) Methanol 11) Chloromethanal H'-'C I N~. CH 3 II O

CH3NH 2 C! ~C'-~ O CH3--N--- C-~ O CI 15) Methyl Formamide

12) Methyl Amine 13) Phosgene 14) Methyl lsocyanate

Figure 3" Co-Material Design R e s u l t s - Carbaryl Example

201

Note that species 8, 11, 13, 14 and 15 are included as additional molecules since none of these can be constructed according to the structural restrictions employed. Four further additional molecules were also included, as shown in Figure 4.

H 2 16) Hydrogen

0 2 17) Oxygen

a 2 0 18) Water

HCI 19) Hydrogen Chloride

Figure 4" Additional Molecules

MULTI-STEP STOICHIOMETRY IDENTIFICATION RESULTS The solutions of the stoichiometry identification program are presented in the form of a table of stoichiometric coefficients in Table 3, where blank spaces indicate zero coefficients and the species are numbered as above. According to the industrial routes, stoichiometries of up to two steps in length were allowed, with a maximum of four species permitted in any step. The role specification and chemistry constraints employed in this example are given in Appendix A. No carbon structure constraints were employed in this example. A production rate (c~vp) lower bound of 2.5 kmol/hr and an allowable reactor temperature range of 300-800K were imposed.

Table 3" Multistep Stoichiometr ies- Carbaryl Example

]] Species ]]

Index]Nsp~ 11 2[ 3]4151 6171819110111112113114115116117118119 Toper K [ erVp kmol/hr I Profit $/mol I CTAM tnair/mol

A 3 B 4 C 4 D 4

E 3 F 4 G 4 H 4

I I 4

K 4

L 4 M 4 N ! 4

-2 2 -1 1

-1 1 -1 1

System 0 - Producing Species 7 1 -1

-1 1 1 -1 1 -1 1

-1 1 -1 System 1 - Producing Species 3

1 -1

System 1 - Producing Species 14 I I I I I I I I I I l-1[-11 11 1

System 1 - Producing Species 15

I I I r I I I 1-1111:II I I il ,11 System 1 - Producing Species 6

-1 -1 1

1 1 1 2

11

300 9.99 0.4508 19.57 300 5.51 -2.9885 22.22 300 10.00 0.5026 16.63 300 3.17 -2.9485 16.08

300 20.00 0.5509 6.90 300 9.34 0.5013 18.23 300 10.00 0.5015 17.85 300 10.00 0.5249 13.44

s00 I lo.00 101ss71 5~s

300 I 2 .7510.04001 4.20 736 2.50 0.0535 2050.78

300 10.00 3.95561 17.86 300 10.00 3.49111 14.11 300 10.00 3.5880 10.27

System zero produced four candidate stoichiometries that satisfy all constraints,

202

in which materials 3, 6, 12, 14 and 15 appear as first generation precursor reactants. Systems 1A and 1B produced a total of ten further stoichiometries leading to all of these materials except species 12, which is allowed only as a reactant in systems 1A and 1B since it could only be produced by decomposing more complex naphthyl molecules. All stoichiometries except I and K achieve acceptable conversion at 300K.

For stoichiometry I, Gsys is minimised at 800K and high conversion since AGR for this reaction has a large negative temperature gradient, so that equation 56 is dominated by its first RHS term. For stoichiometry K however, reactor temperature has to be elevated to meet the production rate bound, so that this stoichiometry is the only one for which T' > 300K (from equations 57-59).

The two industrial chemistries shown in Figure 2 were reproduced; stoichiometries A and I representing the methyl isocyante route, and stoichiometries D and N representing the non-methyl isocyanate route. Stoichiometries C and D represent the first and third of the three alternative single step routes put forward by Crabtree and E1-Halwagi (1994), their second alternative does not appear here since it involves three apparently simultaneous reactants.

Table 4 shows the total profits and impacts for the individual solutions combined in to multi-step stoichiometries. For example, the index AEI denotes the combination of steps A, E and I.

Note that the profits reflect only the values of the products minus the values of the reactants, assuming that stoichiometric co-products are sold at their market value, and that in this example, raw materials are assumed input waste free. Note also that stoichiometries with poor conversion are not penalised since the costs and impacts of separation are not included here, and it is assumed that unconsumed reactants are recycled with no loss of heat and no compression or pumping requirements.

Only stoichiometries involving step K can justifiably be eliminated from further consideration on impact grounds, this step being penalised in impact terms by high reactor temperature, and only stoichiometries involving steps M or N can justifiably be eliminated on economic grounds. Despite the fact that these steps both exhibit high profits, species 6 is such a high value material that only step L generates sufficient profit to cover the cost of consuming species 6 in system zero. It is for this reason that stoichiometry DL remains competitive despite the poor economic performance of step D, which was rejected by Crabtree and E1-Halwagi (1994) on economic grounds. This clearly illustrates the advantages of considering multi-step production routes.

Of the remaining ten stoichiometries, the original industrial chemistry (steps A and I) with the addition of step E, F, G or H to produce species 3, exhibits the

203

Table 4: Total Profits and Impacts

Index Total Profit Total CTAM $/mol tnair/mol

A E I A F I A G I A H I B J L B J M B J N B K L B K M B K N

E E F F G G H H

D L D M D N

J K J K J K J K

1.1403 1.1408 1.1409 1.1644 1.0070 0.5426 0.6394 1.0205 0.5561 0.6529 1.0453 1.0570 1.0439 1.0574 1.0441 1.0576 1.0675 1.0810 1.0070 0.5426 0.6394

32.15 43.48 43.10 38.69 44.28 40.53 36.69

2090.86 2087.12 2083.27

27.72 2074.31

39.05 2085.64

38.68 2085.26

34.27 2O8O.85

33.93 30.19 26.34

most promising economics of all, which is probably why it was selected. Fur- thermore, the environmental impacts of the routes based on this chemistry are also among the most promising. Of these routes, stoichiometry AEI represents the best compromise solution. Only stoichiometry CEJ exhibits a significantly lower impact than AEI, with only a marginally reduced profit, and so appears to represent the best compromise solution of all. However, conversion is poor for step J so tha t higher separation and recycle costs are anticipated. Issues such as this mus t be explored in order to eliminate further alternatives.


In the work presented here, a procedure for the rapid identification of alternative multi-step stoichiometries has been described in which each stoichiometric step involves whole number stoichiometric coefficients and a limited number of species. The key to the procedure is the introduction of mater ia l design principles to formalise the development of a set of co-materials from which stoichiome-

204

tries are then extracted using an optimisation procedure.

The co-material enumeration procedure is based on a set of structural and chemical feasibility rules from Constantinou et al. (1996). However, ra ther than employing their molecular generation algorithm, the rules are instead used to develop a set of linear integer constraints governing the numbers and combinations of particular structural groups in a molecule. Combining these rules with the octet rule, and additional structural restrictions to limit the total number of groups, and the numbers of branches, substituents, substituted sites and functional groups, results in the co-material enumeration MILP formulation. This problem is solved repeatedly, introducing integer cuts after each iteration to exclude previous solutions, to produce a set of co-materials.

Stoichiometries are then extracted from this set of materials using an optimisation procedure, in which stoichiometries are explicitly enumerated and subsequently evaluated in two sequential steps. Stoichiometry enumeration includes whole number stoichiometric coefficients constraints, constraints to restrict changes to the carbon skeletons of the reacting species, and case specific constraints based on chemical knowledge. Thermodynamic, economic and environmental impact criteria are employed in the evaluation of the stoichiometries, with aspects of the MEIM (Pistikopoulos et al., 1994) providing the framework for the environmental evaluation of alternatives.

The illustrative example has shown that the co-material design technique provides an interesting set of co-material molecules and that, with the inclusion of a few simple rules based on chemical knowledge, it is possible to limit the quanti ty of co-materials to a manageable number. Furthermore, by incorporating simple chemical rules along with thermodynamic, economic and environmental criteria in stoichiometry identification it is possible to identify a small number of alternative stoichiometries which are promising both in terms of economics and environmental impact. Moreover, it has been shown that developing multi- step stoichiometries directly can lead to the acceptance of alternatives which would be rejected as single step syntheses.

In the illustrative example, existing industrial chemistries were identified as the most promising compromise solutions, with several new and competitive alternatives. This suggests that the approach could lead to promising results in the search for production routes for new molecules. Fur ther reinforcement of this conclusion appears in a second application presented in Chapter 14.


AGCIH. Airborne Hazards at Work. American Conference of Governmental In- dustrial Hygienists. Great Britain Factory Inspectorate. London (1977)

205

Agnihotri, R.B. and R.L. Motard. Reaction Path Synthesis in Industrial Chem- istry. Computer Applications to Chemical Engineering, ACS Symposium Series 124, 193-206 (1980) Androulakis, I.P. Kinetic mechanism reduction based on an integer programming approach. AIChE Journal 46(2), 361-371 (2000) Buxton, A. Solvent Blend and Reaction Route Design for Environmental Impact Minimisation. PhD Thesis. Imperial College, London (2002) Buxton, A., A.G. Livingston and E.N. Pistikopoulos. Reaction Path Synthesis for Environmental Impact Minimization Computers chem. Engng. 21, $959-$964 (1997) Chemical Prices. Chemical Marketing Reporter 253(8), 25-35 (1998) Compounds. Dictionary of Organic Compounds. Chapman & Hall, London 6th Edition (1996) Constantinou, L., C. Jaksland, K. Bagherpour and R. Gani. Application of the Group Contribution Approach to Tackle Environmentally Related Problems. AIChE Symposium Series, Volume on Pollution Prevention through Process and Product Modifications 303, 105-116 (1994) Constantinou, L., K. Bagherpour, R. Gani, J.A. Klein and D.T. Wu. Computer Aided Product Design: Problem Formulations, Methodology and Applications. Computers chem. Engng 20(6), 685-703 (1996) Corey, E.J. and W.T. Wipke. Computer Assisted Design of Complex Organic Syntheses. Science 166 (1969) Corey, E.J., W.T. Wipke, R.D. Cramer III and W.J. Howe. Techniques for Percep- tion by a Computer of Synthetically Significant Structural Features in Complex Molecules. J. Am. Chem. Soc. 94, 431 (1972) Corey, E.J., H.W. Orf and D.A. Pensak. Computer Assisted Synthetic Analy- sis. The Identification and Protection of Interfering Functionality in Machine- Generated Synthetic Intermediates. J. Am. Chem. Soc. 98, 210 (1976) Crabtree, E.W. and M.M. E1-Halwagi. Synthesis of Environmentally Accept- able Reactions. AIChE Symposium Series, Volume on Pollution Prevention via Process and Product Modifications 90, 117-127 (1994) Dagani, R. Data on MICS Toxicity are scant, leave much to be learned. Chemi- cal & Engineering News 63(6), 37-40 (1985) Derringer, G.C. and R.L. Markham. A Computer-Based Methodology for Match- ing Polymer Structures with Required Properties. Journal of Applied Polymer Science 30, 4609 (1985) Edwards K., T.F. Edgar and V.I. Manousiouthakis. Reaction mechanism simplification using mixed-integer nonlinear programming. Computers chem. Engng. 24, 67-79 (2000) Fornari, T. and G. Stephanopoulos. Synthesis of Chemical Reaction Paths: The Scope of Group Contribution Methods. Chemical Engineering Communications 129, 135-157 (1994a) Fornari, T. and G. Stephanopoulos. Synthesis of Chemical Reaction Paths: Eco-

206

nomic and Specification Constraints. Chemical Engineering Communications 129, 159-182 (1994b) Fornari, T., E. Rotstein and G. Stephanopoulos. Studies On the Synthesis of Chemical Reaction Paths - II. Reaction Schemes with Two Degrees of Freedom. Chemical Engineering Science 44(7), 1569-1579 (1989) Gani, R., B. Neilsen and A. Fredenslund. A Group Contribution Approach to Computer-Aided Molecular Design. AIChE J. 37, 1318-1332 (1991) Gao, C.,R. Govind and H. Tabak. Application of the Group Contribution Method for Predicting the Toxicity of Organic Chemicals. Environmental Toxicology and Chemistry 11, 631-636 (1992) Gelernter, H., N.S. Sridharan, A.J. Hart, F.W. Fowler and H.J. Shue. An Appli- cation of Artificial Intelligence to the Problem of Organic Synthesis Discovery. Topics Curr.Chem. 41, 113 (1973) Glover, F. Improved Linear Integer Programming Formulations of Nonlinear Integer Problems. Management Science 22(4), 455-460 (1975) Govind, R. and G.J. Powers. Studies in Reaction Path Synthesis. AIChE J. 27(3), 429-442 (1981) Heijungs, R., J.B. Guinee, G. Huppes, R.M. Lankreijer, H.A. Udo de Haes, A. We- gener Sleeswijk, A.M.M. Ansems, A.M.M. Eggels, R. van Duin, H.P. de Goede. Environmental Life Cycle Assessment of Products: Background and Guide. Multicopy. Leiden (1992) Hendrickson, J.B. A General Protocol for Systematic Synthesis Design. Topics in Curr.Chem. 62, 49 (1971) Holiastos, K. and V. Manousiouthakis. Automatic Synthesis of Thermodynami- cally Feasible Reaction Clusters. AIChE J. 44(1), 164-173 (1998) ISO 14040. Environmental Management- Life Cycle Assessment- Part 1: Prin- ciples and Framework. (1997) Joback, K.G. Unified Approach to Physical Property Estimation using Multi- variate Statistical Techniques. Master's thesis. MIT, Cambridge, Mass (1984) Joback, K.G. Designing Molecules Possessing Desired Physical Property Values. PhD thesis. MIT, Cambridge, Massachussetts (1989) Joback, K.G. and G. Stephanopoulos. Designing Molecules Possessing Desired Physical Property Values. Proceedings FOCAPD, CACHE Corporation, Austin, Texas 11, 631-636 (1989) Joback, K.G. and G. Stephanopoulos. Designing Molecules Possessing Desired Physical Property Values. Advances in Chemical Engineering 21 - Intelligent Systems in Process Engineering, Academic Press (1995) Kalelkar, A. S. Investigation of Large Magnitude Accidents: Bhopal as a Case Study. Authur D. Little Inc., Cambridge, Massachussetts (1988) Kaufmann, G. Computer Design of Synthesis in Organo-Phosphorous Chem- istry. Computer-Assisted Design of Organic Synthesis, Table Ronde Roussel UCLAF, Paris (1977) Knight, J.P. Computer-Aided Tools to Link Chemistry and Design in Process

207

Development. PhD Thesis, Massachusetts Institute of Technology (1995) Mavrovouniotis, M.L. and D. Bonvin. Design of Reaction Paths. FOCAPD, AIChE Symposium Series 91, 41-51 (1995) May, D. and D.F. Rudd. Development of Solvay Clusters of Chemical Reactions. Chem. Eng. Sci. 31, 59 (1976) Perry, R.H. and D. Green. Perry's Chemical Engineers' Handbook. 6 th ed.. Mc- Graw Hill (1984) Pistikopoulos, E.N., S.K. Stefanis and A.G. Livingston. A Methodology for Min- imum Environmental Impact Analysis. AIChE Symposium Series, Volume on Pollution Prevention through Process and Product Modifications 90(303), 139- 150 (1994) Porter, K.E., S. Sitthiosoth and J.D. Jenkins. Designing a Solvent for Gas Ab- sorption Trans IChemE 69(A), 229-236 (1991) Rotstein, E., D. Resasco and G. Stephanopoulos. Studies on the Synthesis of Chemical Reaction Paths - I. Chemical Engineering Science 37(9), 1337-1352 (1982) SETAC. A Conceptual Framework for Life-Cycle Impact Assessment. (1993) Shrivastava, P. Bhopal, Anatomy of a Crisis. Ballinger Publishing Company, Cambridge, Massachussetts (1987) Sirdeshpande, A.R., M.G. Ierapetritou and I.P. Androulakis. Design of flexible reduced kinetic mechanisms. AIChE Journal 4"/(11), 2461-2473 (2001) Stefanis, S.K. A Process Systems Methodology for Environmental Impact Mini- mization. PhD Thesis. Imperial College, London (1996) Stefanis, S.K. and E.N. Pistikopoulos. A Methodology for Environmental Risk Assessment for Industrial Non-Routine Releases. Ind. Eng. Chem. Res. 36, 3694-3707 (1997) Ugi, I. and P. Gillespie. Chemistry and Logical Structure. 3. Representation of Chemical Systems and Interconversions by BE matrices and their Transforma- tion Properties. Agnew. Chem. Ind. Ed. Engl 10, 914 (1971) Van Krevelen, D.W. and H.A.G. Chermin. Estimation of the Free Enthalpy (Gibbs Free Energy) of Formation of Organic Compounds from Group Contri- butions. Chem. Eng. Sci. 1, 66-80 (1951) Weissermel, K. and H.-J. Arpe. Industrial Organic Chemistry. Second, Revised and Extended Edition. VCH, Weinheim FRG (1993) Wipke, W.T., H. Braun, G. Smith, F. Choplin and W. Seiber. SECS Simulation and Evaluation of Chemical Synthesis: Strategy and Planning. In: Computer- Assisted Organic Synthesis (W.T. Wipke and W.J. Howe, Eds.) ACS Symposium Series 61 (1976) Worthy, W. Methyl Isocyanate: The Chemistry of a Hazard. Chemical and En- gineering News 63(6), 27-33 (1985)

208

A p p e n d i x A: Role Spec i f icat ion and Chemis try Constra ints for Case S tudy - 1 Manufac ture of Carbaryl

A.1 Role Spec i f i cat ion Constra ints

Table 5 shows the knowledge based role specification constraints employed in the carbaryl example, where R denotes reactant only, P denotes the final product, C denotes product or co-product, N denotes the exclusion of a species from a system and a blank space denotes no restriction.

Table 5: Role Specification Cons t r a in t s - Carbaryl Example

System 0

1A & 1B

Species 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 R R R R R R P R R R C N C

C C C C R C C C R

These constraints were developed specifically for two step stoichiometries according to the following arguments, based on chemical knowledge and the existing industrial chemistries:

�9 The product (carbaryl, species 7) should appear only as the product in system zero, or as a product or co-product in systems 1A and 1B, never as a reactant.

�9 Other naphthyl group containing molecules should be reactants only in system zero (i.e. no naphthyl containing co-products should appear in system zero), and naphthyl containing compounds with complex substi tuents (i.e. species 4, 5, and 6) should be products or co-products only in systems 1A and lB.

�9 Methyl isocyanate and methyl formate (species 14 and 15) may be produced only in systems 1A and 1B and consumed only in system zero (i.e they may not be decomposed to produce simpler molecules).

�9 /-/2 (species 16) appears as a coproduct only in all systems since hydrogena- tion reactions are not required.

�9 H C I (species 19) appears as a co-product only in system zero, and as a reactant (Cl provider) or co-product (as in the industrial chemistries) in systems 1A and lB.

�9 H 2 0 (species 18) appears as a possible O H group donor or recipient in all systems.

209

�9 02 (species 17) appears as an oxygen provider in systems 1A and 1B only.

A.2 Chemistry Constraints

The following knowledge based chemistry constraints were employed:

�9 naphthyl containing species with complex substi tutions not allowed as reactants

ii4 + ii5 + ii6 = 0 (73)

�9 other naphthyl group containing species may not react with each other

iil § ii2 § ii3 § ii7 _ 1 (74)

PART II: Appl i ca t ions of CAMD

The first part of the book dealt with some of the solution techniques commonly employed to tackle the CAMD problem. This part demonstrates the application and practice of some of those techniques to different types of problems in CAMD.

Chapters 8 & 9 describe the industrial application of CAMD methods for solvent design and selection. In particular, the use of the hybrid CAMD method (chapter 6) is highlighted. Chapter 10 deals with optimal solvent design for blanket-wash using the optimization-based CAMD method of chapter 3. Chapter 11 extends the application of CAMD from single solvent design to solvent mixture design together with an application example. Chapter 12 provides an example of the application of the global optimization-based CAMD method of chapter 4 to optimal refrigerant design while chapter 13 highlights the application of genetic algorithm-based CAMD (chapter 5) to polymer design. Chapter 14 provides a detailed case study of the application of CAMD to identify multistep reaction stoichiometries (using the method described in chapter 7). Finally, chapter 15 presents the application of CAMD to design of fuel additives employing the genetic-algorithm based CAMD method of chapter 5.


Chapter 8: CAMD for So lvent Se lec t ion in I n d u s t r y - I

J. M. Vinson

8.1 INTRODUCTION

While the process design for a new commercial drug is not the critical step of get t ing a new drug to market , it is impor tant to do a good job of scaling from the laboratory to full production. A good pharmaceut ical manufac tur ing process is insensit ive to small var iat ions in operating conditions and runs in a reasonable time. Along with the commercial manufactur ing needs, batches of active pharmaceut ica l ingredient are required for clinical trials, so processes in development must be capable of delivering ever-increasing amounts of active pharmaceut ica l ingredients (API).

One aspect of the development of production processes for API's is the selection of appropria te solvents for dissolution of raw materials , reactions and product crystall ization. The most common mechanism for de termining solvents is to select from a common set of solvents, as pharmaceut ica l companies are not in teres ted in developing new solvents for their processes.

While this method is sufficient for most cases, si tuations arise where none of the usual solvents are terribly effective. When the t imelines are short, the developers have little choice but to go with the best of the ineffective solvents, or to use a more complex procedure. For example, the procedure might call for several extractions and back-extractions to achieve an acceptable yield with a moderate solvent, whereas a bet ter solvent might be available tha t can effect the extract ion in one or two steps.

CAMD is one tool tha t can help quickly point to a number of candidate solvents. The goal is to help the development team find candidate solvents tha t they may not have considered in the normal course of development.

As we have seen throughout this book, computer aided molecular design is a methodology in which molecules are designed to meet specific needs. While the approaches vary widely, depending on the application area, they all require the abili ty to predict the behavior of the full compound. This is accomplished

214

through molecular dynamics, expert systems, genetic algorithms, contribution methods, and combinations of these techniques.

group

The hitch for complex industrial compounds, such as those found in pharmaceutical applications, is that it is not always possible to accurately predict the properties of the compounds. This chapter will describe the application of computer-aided molecular design to situations where the standards CAMD techniques do not obviously apply.

8.2 CAMD METHODOLOGY USED

8.2.1 Genera l i zed CAMD F r a m e w o r k

As described in chapters 6, the basic approach to compute-aided molecular design (Harper, 2000) is a three-step process:

1. Pre-Design: Define the problem in terms of desired properties of the compound to be designed. At this stage it is also critical that one select the best formulation of the problem, as the problem with the most clarity and the most available data will be easier to solve. Since design is an iterative process, it is not unlikely that one will come back to the pre-design stage to evolve the problem definition, based on results obtained in stages two and three.

2. CAMD: Run the actual CAMD algorithm to generate compounds and test them against stated criteria from the pre-design stage.

3. Post-Design: Test the results based on properties that are not easily screened during stage two, such as environmental, health & safety criteria. The compounds must also be tested either in simulation or in the laboratory.

8.2.2 Spec ia l Fea tures for Complex So lutes

The very nature of specialty chemical and pharmaceutical development is of working with new compounds. Many of these contain unusual active groups or combinations of active groups that make property predictions dubious by standard means. These factors add an extra degree of difficulty to solvent selection problems.

However, one can make use of extensive experimental apparatus to enable computer-aided molecular design. In particular, where traditional methods of solvent selection by experiment do not result in an acceptable solvent, the results of those very experiments can help point researchers in the right direction. The following seven-step procedure (Vinson et al. 2000) details how to combine

215

experimental work with CAMD to help find appropriate solvents for complex solutes.

Step 1. Select N solvents with solubility parameter values between those of hexane (minimum) and water (maximum).

Step 2. Compute solubility of the solute in each of the N solvents using the regular solution theory (requires only solubility parameter value).

Step 3. Plot the calculated solubility values against the known solubility parameter values of each solvent and identify the location of the maximum solubility value together with the corresponding solubility parameter value.

Step 4. Based on the solubility parameter value from step 3, identify compounds with similar solubility parameter values from the database to produce a list of compounds with known properties such as melting points, boiling points, and so on.

Step 5. Use the generated data to define the solvent design/selection problem and go to stage 2 of the generalized CAMD framework (see chapter 6).

Step 6. Validate the selected compounds (solvents) from step 5 by plotting their ability to dissolve the solute on the solubility versus solubility parameter plot. They should all lie near the maximum.

Step 7. Consider other properties as given in the post design phase for further screening and final selection.

Note that the first two steps of this procedure are traditionally conducted in the lab in the search for appropriate solvents. The remaining steps walk researchers through a mechanism to develop a CAMD formulation to find alternate solvents.

8 .3APPLICATION EXAMPLES

With the case studies, we explore the use of CAMD for solvent selection in a number of examples inspired by the pharmaceutical industry. Only the first example is worked in full detail. The second example provides the basic setup of the problem and suggestions as to how one might approach finding appropriate solvents via CAMD. The final example is a challenge problem, which is presented to show the full complexity of solving computer aided molecular design in a real-world situation. The ProCamd (ICAS Documentation, 2002) software developed at CAPEC (www.capec.kt.dtu.dk) has been used for the solution of the problems

8.3.1 Example 1: Extract ion Solvent Replacement

This case study is an example of CAMD used for solvent selection. Not only does this example show the difficulty of handling complex compounds, but it also demonstrates the need for well-thought out problem formulation. In this case,

216

there are several problems to be handled by a new solvent. The first task is to determine which problem has the highest likelihood of successful resolution.

This example is of a reaction system, followed by extraction. The basic chemistry is described in Figure 1. The first reaction is a peptide coupling between compounds A and B with diisoproyl carbodiimide (DIC) as a coupling agent and N-hydroxybenzotriazole as a catalyst. This liquid-phase reaction runs in a solvent mixture of 1:1 dimethylformamide (DMF) and methylene chloride (MeC1). Reactant A has limited solubility in these solvents, thus the reaction runs over several hours. The second reaction is a saponification that hydrolyzes the ethyl group in compound C with 2.5 N sodium hydroxide. The current process calls for no isolation between the first and second reaction, which is common in the pharmaceutical industry due to purity concerns. The second reaction is followed by an extraction in methylene chloride, leaving the product in the aqueous layer. The final workup (not shown in Figure 1) involves an isoelectric precipitation that isolates the product as a zwitterion. The DIU byproduct of reaction 1 is somewhat soluble in water, which leaves DIU with the product throughout the precipitation step and necessitating additional back-extractions after the second reaction to purify the aqueous phase.

R .~ . ..OCH2CH 3 + HC1 �9 NH2"" R ' ~COO[-I

H

A (solid) B (solid) O

1) 1" 1 DMF/CH2CI 2 2) 0.2 equiv. N-hydroxybenzotriazole 3) ambient temp., 6 h

Reaction 1

0

R .~ C f N.. R' OCH2CH 3 O

C (dissolved)

+ NaOH

A ~Y_ / N ~C ~

diisopropyl carbodiimide

0

R ' C / ~ q " R, '~OCH2CH 3 O

C (dissolved) H diisopropyl urea (dissolved)

Reaction 2

O

R x c / N " R ' ONa 0

D (dissolved)

+ CH3CH2OH

+ HCI

Figure 1: Reaction chemistry details for example 1

While scaling this process from the laboratory to the manufacturing scale, several inefficiencies in this process became evident, and the development team began looking into alternatives. As far as CAMD goes, there are a number of problems to solve, several of which interact with one another: One could attempt to find a replacement solvent for methylene chloride, as it is environmentally undesirable for large-scale operations. A solvent that preferentially removes the DIU impurity from the first reaction mixture would reduce the need for additional workup after the second reaction. Finally, the reaction time could be

217

improved by finding a new solvent (or solvent system) that does a better job of dissolving compounds A and B.

Pre-Design Phase

After studying the available data and the compounds in question, it became clear that the best problem to solve is that of removing the DIU impurity from the post-second-reaction mixture. While it would be more profitable to explore a better reaction system solvent, there is not enough solubility data on the compounds A, B, C and D to attempt solvent design for them. In addition, these compounds have structures that reduce the utility of the group contribution methods. As we shall see in the results, this particular approach can also help replace or reducing the methylene chloride in the process. The list of options and some comments on their viability are listed in Table 1.

Table 1: Summary of potential problems to solve in Example 1

Option Replace methylene chloride

Remove DIU after the first reaction

Remove DIU after the second reaction

Find a better solvent for the reactants, A and B

Comment Wide range of possible resolutions, may take too | long. Not enough experimental data for this ] system.

j Reduce the number of extractions, but this may be a large process change. DIU solubility data are available. The smallest change, as this is the current process. A better solvent system must be found. ] DIU solubility data are available. Requires significantly more solubility information than is available for the reactants and products. Compounds are not compatible with current group contribution estimation methods.

In this example, we apply the solvent selection approach for complex compounds, described in Section 8.2.2. Steps one and two had been completed in earlier experimental work. DIU solubility was determined in a number of solvents, which happen to span a wide range of total solubility parameters. In selecting solvents, one wants to ensure a good representation of the range of total solubility parameter in order to focus in on the most likely range of total solubility parameter for the designed solvent.

Step three of the method produces a plot of solubility data for DIU as a function of total solubility parameter for the solvents. This is shown in Figure 2. Since the peak on Figure 2 is around total solubility parameter of 22, it is most likely that the best solvents for DIU will have a total solubility parameter between 21

218

and 23. It is also clear that the solute is not very soluble either in paraffinic or cyclic hydrocarbons (solubility parameter around 15 MPa 0.5) and only slightly so in water (solubility parameter of 47.8 MPa ~ and other polar solvents. The likely solvents having similar solubility parameters are acrylic alcohols, ketones, aldehydes, acids, esters and ethers. Aromatic solvents, though they may fit into this total solubility range, will not be considered due to their poor health effects profiles.

Solubility of Diisopropylurea (%w/w)

4.5-

3.5-

2.5-

0.5

0

14

A A �9 �9 �9

16 18 20 22 24 26 28

Solubility Parameter (MPa)^.5

Figure 2: Plot of solubility versus total solubility parameters for DIU as the solute

In Step four the other properties of the solvent are identified, either by comparison with database materials or by specification of the process. In this case, the process creates the specifications. Since the solvent must be a liquid at the operating conditions, its melting point must be less than 300 K. Since DIU is to be removed from the reaction mixture, the solvent should split the reaction mixture into two phases with the solvent rich phase containing the majority of the DIU, which is not totally miscible in water. As a first pass environmental assessment, the Octanol/Water partition coefficient should be kept as low as possible. Also, for recovery concerns, the solvent must be easy to separate from DIU and therefore, its boiling point should not be greater than 450 K for separation by evaporation or distillation. These target properties are listed in Table 2.

219

Table 2: Target properties for example 1

Property Total solubility parameter Boiling point Melting point Octanol/Water partition coefficient (logP) Water solubility Water capacity of solvent Groups to search

Target / Range 21 - 23 MPa ~

Less than 450 K Less than 300 K Less than 2

Immiscible Less then 1.0, preferably zero.

CH3, CH2, CH, C, OH, CH3CO, CH2CO, CHO, CH3COO, CH2COO, HCOO, CH30, CH20, CH-O, COOH, COO Low environmental impact Limited health & safety concerns

CAMD Phase

Moving into step five of the process described in Section 8.2.2, new compounds were generated by ProCAMD based on the specifications listed in Table 2. This problem formulation generated 3498 compounds, based on combining groups to form only chemically feasible molecules. The octanol/water constraint removed 260 compounds. The total solubility parameter constraint removed 2634 compounds. The melting and boiling point constraints removed 534 and 39 compounds, respectively. The solvent capacity constraint removed another 17 compounds, leaving 14 final compounds. Of these, about half appear in the DIPPR database of compounds. Table 3 gives a list of those designed solvents that appears in the database together with their water solubility and EH&S properties. Note that 2-butanol and 2-methyl-l-propanol are isomers in terms of the groups that make up the compounds (2 CH3, CH2, CH, OH). The final four compounds (five-carbon alcohols) are also isomers with respect to their groups (2 CH3, 2 CH2, CH, OH). Note that the predicted and actual total solubility parameters do not necessarily match.

220

Table 3: Potential solvents for Example 1

C o m p o u n d (CAS #)

P r e d i c t e d So lubi l i ty W a t e r So lubi l i ty P a r a m 1 So lubi l i ty 2

e a r a m . . (mg /L) 1 -Butanol (000071-36-3)

22 .47 2 3 . 3 5 3 6 6.32E+004

t-Butanol (000075-65-0)

21 .47 2 1 . 6 0 3 4 1E+006

RTECS C o d e 3

Mutagen (M) ; Reproductive- Effector (T); Human-Data (P);

, Primary-Irritant (S)

2-Methyl- 1- Propanol (000078-83-1)

21 .83 2 2 . 9 0 9 4 8.5E+004 Tumorigen (C); Mutagen (M)

2-Butanol (000078-92-2)

21 .83 22.5414 1.81E+005 Reproductive-Effector (T)

1 -Pentanol (000071-41-0)

21.72 22.576 2.2E+004

Ethylene-glycol- 21.65 20.0055 3.169E+005

Mutagen (M); Primary-Irritant (s)

monopropyl- ether (002807-30-9) Ethyl Lactate 22.62 (000097-64-3)

22.3818 1E+006

2-Methyl-I- butanol (000137-32-6)

21 .16 22.1107 2.97E+004 Primary-Irritant (S)

3-Methyl- 1- butanol; Isoamyl alcohol (000123-51-3)

21 .16 2 2 . 1 5 7 4 2.67E+004 Tumorigen (C); Human-Data (P); Primary-Irritant (S)

2-Pentanol 21.16 21 .704 4.46E+004 (006032-29-7) 3-Pentanol 21 .16 2 1 . 1 2 2 7 5.15E+004 (000584-02-1)

To explore the sensi t ivi ty of the CAMD method, we can adjus t each of these cons t ra in t s individual ly th rough ProCAMD. For example, t igh ten ing the logP cons t ra in t to less t han 1.0 will filter over 1000 compounds and change the resul ts of the subsequen t filters as well, leaving ten compounds in the end. This informat ion is l isted in Table 4, showing the change and the n u m b e r of compounds screened out for each condition along with the number of compounds r e m a i n i n g af ter screening. In a quick assessment , the most sensit ive p a r a m e t e r for the overall problems is the boiling point range, as ra is ing the upper bound increases the number of candidates to 96 from four.

1 Solubility parameter data is from the CAPEC Database (www.capec.kt.dtu.dk). 2 Water solubility data from: "SRC PhysProp Database" (online version), Syracuse Research Corporation, Syracuse, NY, USA. a RTECS Code is from: WebSpirs version of"RTECS (through 2000/04)."

221

Table 4: Number of compounds screened out for a variety of conditions

Change

Original constraints logP to less than 1 Water capacity max 0.5 Melting point max 250 K Melting point max 200 K Boiling point max 500 K Boiling point max 350 K

logP Sol. Par. Tm Tb

260 2634 181 392 1031 1991 426 23 260 2634 181 392 260 2634 534 39 260 2634 598 0 260 2634 181 282 260 2634 181 463

Water Capacity 17

Final Compounds 14

17 10 28 3 17 14 2 4 45 96 0 0

Post-Design Phase

Steps 6 and 7 of the design procedure for complex solutes call for additional analysis of the candidate solvents found from the CAMD phase. Ideally, we would test the apparent ly best compounds as solvents for DIU. In this case, of the solvents listed in Table 3, l -Butanol and 2-Methyl-l-Propanol were easily available from our stockroom. A quick test of DIU solubility showed excellent results for both the butanol (6.25 wt% at 25~ and methyl propanol (6.48% at 25~ These values were 50% higher than the best solvent tested in our prior work, as shown in Figure 2.

We then took a new look at the reactions above and decided to work with butanol as an extraction solvent after the second reaction. We examined two possibilities. The first option was to keep the DMF/MeC12 mixture for the first reaction and conduct one methylene chloride extraction after reaction two, followed by a butanol extraction. As this option was not substantial ly different than the s tandard chemistry, there were no significant improvements in the yield or puri ty of the mixture after the second reaction.

The second option was to remove methylene chloride from the chemistry entirely, as it does not part icipate in the reaction, and use butanol as the only extraction solvent after reaction two. For comparison purposes the reaction mixture after the second reaction was divided into two portions. One portion was extracted with two methylene chloride t rea tments per the standard. The other portion was extracted with an equivalent volume butanol in two extractions. The results for these experiments are listed in Table 5.

222

Table 5: Example 1 Results

Content of aqueous layer (percent of original) DIU total

MeCI Extraction

Butanol Extraction

0% 0% D (extraction 1) 98% 74%

94% 99% 92%

D (extraction 2) D total 73% DMF 20% 11% HOBT 76% 8%

~95% D-urea impurity ~20%

Overall both solvents remove all the DIU after two extractions, which is the primary goal of the extractions. The butanol system is at a slight disadvantage for product recovery due to the first extraction pulling about 26% of the product, D, into the butanol layer. Clearly, this extraction could be optimized to achieve greater than 90% total recovery of product from the extraction step. The other advantage of the butanol extractions is that the content of both the DMF and HOBT in the aqueous layer is much lower. Finally, the butanol extractions were successful in significantly reducing the level of the D-urea reaction byproduct. While this was not stated as a primary goal, reducing the amount of extraneous organics in the aqueous phase is advantageous to the subsequent precipitation reaction and purification steps. The all-butanol extraction option has the additional advantage of removing methylene chloride entirely from the process.

8.3.2 E x a m p l e 2: Mass s e p a r a t i n g a g e n t

In this example, the development team was exploring the design of a manufacturing process to recover pure product and potentially recover solvents for reuse. The method used to design the process was based on the process synthesis procedures of Jaksland (1995). The synthesis procedure explores the properties of the mixture to select appropriate separation techniques.

The original design of the process simply used distillation and then n-heptane as an anti-solvent to crystallize the product, which was filtered and dried, as shown in Figure 4. However, there were some disadvantages to heptane, particularly regarding solubility of the impurities that end up as solids with the X-2P product. The challenge for CAMD was to find a suitable replacement mass-separating agent (MSA) for heptane that will cause the X-2P to precipitate out of solution while retaining the toluene and reactants in the liquid phase.

223

Figure 3: Example 2 process flow

The primary chemistry in this example is

X-2R1 + X-2R2 ---> X-2P + H20

This takes place at atmospheric conditions in the presence of toluene as the primary solvent with MTBE carried over from previous processing with X-2R2. Table 6 gives the approximate composition of the post-reaction mixture and for which the process synthesis has been conducted.

Table 6: Example 2 post-reaction mixture composition

C o m p o n e n t

X-2R1 MTBE X-2R2

C o n c e n t r a t i on (mol%)

0.1

S t a t e of p u r e c o m p o n e n t s at 298~ 1 a t m

S t a t e Solid Liquid Solid

Tb (K) 524.33 328.35 445.91

Tm (K) 443 164.55

1 300.93 X-2P 10 Solid 611.1 353.1 Toluene 73.9

10 383.7 373.2

Liquid Liquid Water

178.18 273.2

As in the original process the best mechanism for removing the MTBE and water from the post-reaction mixture is concentration of the mixture in at least one exchange of toluene. At that point the mixture is essentially free of water. However, care must be taken to not remove too much toluene, as the product

224

tends to form a highly viscous tar with the toluene at higher concentrations. As a result, the mixture passed to the product isolation step must be at least 30 wt% toluene.

Pre-Design Phase

The pr imary goal of CAMD in this example is to replace heptane with another MSA to effect the precipitation of the X-2P product while retaining the other materials in the liquid phase. Table 7 lists the solubilities of the compounds in heptane.

Table 7: Solubility of compounds in heptane

Compound

X-2R1

Solubility in heptane (g/cm ̂ 3) 0.0125

X-2R2 0.0186 X-2P 2.83E-04 Toluene 0.397

In the CAMD Pre-Design toluene, X-2R2 and X-2R1 must be miscible in the new solvent, and X-2P must be immiscible. Ideally, the relative values of the solubility will also be greater for the first three and lower for X-2P. Unfortunately, very few of these mixture properties can be predicted due to the complex nature of the solutes. As a first pass, we can use CAMD to find solvents with similar properties to heptane. The target values for this initial CAMD problem are listed in Table 8.

Table 8: Target properties for example 2

Property

Boiling point Melting point Total solubility Groups to search

Heptane value 371.6 K 182.6 K 15.2 MPa o.5 N/A

Target / Range

Less than 400 K Greater than 150 K 1 4 - 16 MPa ~ CH3, CH2, CH, C, OH, CH3CO, CH2CO, CHO, CH3COO, CH2COO, HCOO, CH30, CH20, CH-O, COOH, COO Low environmental impact Limited health & safety concerns

225

CAMD Phase

Based on the specifications above, ProCAMD generates 3498 compounds, filtering 3397 based on solubility parameter, 14 based on melting point and 44 based on boiling point, leaving 43 candidates compounds. Of these compounds, the following were found in the DIPPR databank: MTBE, Ethylal, Ethyl propyl ether, tert-Butyl ethyl ether, Methyl tert-pentyl ether, Diisopropylether, Acetal, n-Butyl ethyl ether, Di-n-propyl ether, and Ethyl-tert pentyl ether.

Post-Design Phase

Based on discussions with the chemist and available compounds from the stockroom, we decided to explore the use of diisopropylether via experimentation. The chemist was also curious about 2-pentanone, which did not appear on the list because its solubility parameter is closer to 18 MPa 0.5. The solubility information for each of these is listed in Table 9 and Table 10, respectively.

Table 9: Solubility of compounds in 2-pentanone

Compound

X-2R1

Solubility in 2- pentanone (g/cm ̂ 3)

0.3043

Relative to heptane solubility 24.3

X-2R2 0.1337 7.19 X-2P 1.685E-04 0.60 Toluene 0.126 0.32

Table 10: Solubility of compounds in diisopropylether

Compound

X-2R1

Solubility in di isopropylether (g/cm^3) 0.072

Relative to heptane solubility 5.76

X-2R2 0.024 1.29 X-2P 4.82E-05 0.17 Toluene 0.326 0.82

With two potential solvents selected at this point, decisions need to be made as to which properties are the most important. In this case, it is most important to reduce the solubility of the X-2P in the solvent. The fact that diisopropylether has less than 20% of the solubility of heptane for X-2P takes precedence over the better solubility for X-2R1 and X-2R2 in 2-pentanone.

226

8.3.3 Example 3: Challenge Problem

From a computer-aided design perspective, this problem has proven difficult. The hope of presenting this problem is to give researchers in this area an idea of the complexities that arise in the real world.

The mixture in this case contains water, acetonitrile, ammonia, and three difficult-to-model internal compounds, as listed in Table 11. The product, X-3P, is an Ammonia-Bromine salt, which impacts any computations on the mixture. The structures of this compound and the other unique compounds are shown in Figure 4. The mixture is highly non-ideal due to the electrolytes present in X-3P and ammonia.

Table 11: Example 3 mixture composition

C o m p o u n d Wt. % Acetonitrile 51.7 Water 29.3 Ammonia 10.3 X-3P (product) 7.7 X-3R (reactant) 0.7 X-3B (byproduct) 0.4

NH3Br

o

CN

N Br

o

CN

X-3P X-3R

o

CN

X-3B

NC

Figure 4: Molecular structures of X-3P, X-3R and X-3B

227

The goal for the operation is to remove the water (to less than 2 wt%) and to drive the composition of the mixture to approximately 15% X-3P in Acetonitrile. The new solvent should also be a liquid at normal conditions. The suggested properties for such a solvent are listed in Table 12.

Table 12: Target properties for challenge problem

Property Boiling Point Melting Point

Miscible with Acetonitrile Immiscible with Water Good solvent for X-3P

Target / Range Less than 400 K Greater than 150 K

Low environmental impact Limited health concerns

First try to find solvents that satisfy all the constraints except those related to the solubility of X-3P. Then use experimental data, if available, to find out which of the candidates have good solubility for X-3P. This will reduce the number of candidates. In the final selection, perform simulation as well as more detailed analysis of the property constraints, especially since the property models used in the design-phase may be subject to errors.

8.4 CONCLUSIONS

This chapter demonstrates the utility of computer-aided molecular design even for complex solutes, where the solvent interaction is difficult to determine analytically. The chapter presents a procedure that combines experimental work with CAMD for complex solutes and then goes on to show how this applies in real situations encountered in the pharmaceutical industry. The final example presents a challenge problem for future computer-aided molecular design researchers.

8.5 REFERENCES

P. M. Harper, "A multi-phase, multi-level framework for computer aided molecular design", ", PhD-thesis, Technical University of Denmark, Lyngby, Denmark, 2000.

228

ICAS Documentations, Internal Report PEC02-14, CAPEC, Department of Chemical Engineering, DTU, Lyngby, Denmark, 2002.

C. Jaksland, "Separation process synthesis and design based on thermodynamic insights", PhD-thesis, Technical University of Denmark, Lyngby, Denmark, 1996

J. Vinson, P. M. Harper, R. Gani, "Solvent selection for chemical and pharmaceutical processes", AIChE Annual Meeting, Paper no. 240 c, Los Angeles, USA, November 2000.

Computer Aided Molecular Design: Theory arid Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian (Editors) �9 2003 Elsevier Science B.V. All fights reserved. 229

C h a p t e r 9: C A M D for S o l v e n t S e l e c t i o n in I n d u s t r y - I I

J. L. Cordiner


Fine Chemicals Manufactur ing is increasingly looking at reducing the t ime to market , this means tha t decisions about the process are pushed fur ther and fur ther back the decision train. These decisions are then required when less and less of the apparent ly required information is available. Conventional wisdom needs to be tested to consider what information is really needed and what level and qual i ty of decision is required at each stage. In some cases, for example pharmaceut icals , the process route needs to be decided very early for regis t ra t ion reasons. The choice of the route can have large implications on the costs of production and capital requirement . It is then advantageous to have methods to challenge the normal route selection and development processes.

This chapter describes two methods & tools tha t may be used in early evaluat ion of processing routes related to solvent selection. These two methods & tools are SMSWIN (developed at Syngenta) and ICAS-tools (developed at C A P E C - http:/ /www.capec.kt.dtu.dk). The methodology applied is briefly described and i l lus t ra ted through two case studies.

9.2 GENERATING AND REVIEWING ALTERNATIVE PROCESS ROUTES

Clearly the synthet ic routes from research (see Fig. 1) are usual ly not practical for a manufac tur ing setting. The chemist and engineer need to work together to consider how all the routes for consideration will be operated at the manufac tur ing scale desired by the business. At this stage it is vital the early evaluation tools are able to aid this process in generat ing processes tha t can be radically different from conventional wisdom. Each chemical route can be operated at a manufac tur ing scale in a number of different ways and these needs to be considered in any route evaluation. In addition the early evaluat ion tools are required to enable comparison of routes and processes to enable the most practical options to be chosen. Clearly the level of information on each route will be sparse at this stage and therefore the tools must allow quali ty decision to be

230

taken on the limited data. It is therefore important to remember tha t comparison requires the data to be consistent but not necessarily accurate at this stage. As it is impor tan t to consider the whole supply chain in route selection one should use the tools alongside experience from different professionals ra ther than expecting the tools to do the whole job.

Research route

~ - - ~ " ~ Generate route 1 I options I ~ I oUtline f/s & I ~ - - / ] ~ - ~ ~ ' ~ I costs I !

~ ~ / I ~ /f / Product / I \ ~ / . . . . specification ,-any Selection cdteria / I \ / I . . . . I market / I \ / I Tormutauon. j forecast

\ / I \ / ) u ,ness k / I k / / ~ targets SHE impact ~ ~ / / /

Market requirements ~ l . . . . . . . . . . . . . . . . I . J f f / r (quality;toxl "I . . . . . . . . . . . . . . . I ~ / ~ l Design, f/s,

Activity ~ ~.____1 / f I cost estimate VPC / margin / ~ ~ / / I \

Capital / X X / / ] / X X FF&P development / ( /

THE KEY DE,~ISION i I / POINT I., �9 Process �9 Ongoing F

~ development development t ~ - ~ . . . ~ I

FF&P=formulation, f i l l a n d pack tr~loSn t ~ --~ Decision tOinvest

manufacture ~ t

Figure 1: Schematic of the development process for an agrochemical product (Carpenter [1,2]).

9.2.1 Chal lenges for the Early Evaluation Tools

The early evaluation tools need to be user friendly, robust and easy to use. In par t icular the tools need to be as intuitive as possible for the infrequent user, minimising the number of forms to be filled in or clicks required. This can be seen in set t ing the most commonly used information at very easy and fast reach as shown in Fig. 2.

Wherever possible an expert system to select items or calculation methods needs to be employed in such a way that it is easy for the non specialist to use the tool whilst providing sufficient information (knowledge) about the problem and guidance to arrive at an acceptable solution. For example, the physical property method for early (solvent) evaluation and setting up of this method needs to be made very easy. This can be demonstrated by pre-setting the groups for UNIFAC (described in chapter 2 of Part I) for as many molecules and frequently used building blocks for molecules as possible as is done typically for UNIQUAC for molecules. The databases in SMSWIN, ICAS and in most commercial simulators already have this feature.

231

Figure 2: Property selection From SMSWIN

Many of the process developers will need help in selecting a property method when considering different solvent options. Here an expert system, as highlighted in the form of a decision tree (see Fig. 3), would be beneficial. This allows rapid selection and points to when further advice from the expert system is required.

The tools should be as visual as possible. Many visualisation tools are provided in SMSWIN and ICAS to help process developers to rapidly access processing route options. For example, residue maps (for evaluation of feasible separation regions), eutectic/azeotropic diagrams (for evaluation of separation constraints), solubility/saturation plots (evaluation of phase boundaries) and many more. Having diagrammatic ways of presenting the same data can aid understanding of the solvent-based separation process. For example, a t r iangular phase diagram highlight the existence of one or two liquid phases in equilibrium with a vapour phase for a ternary mixture consisting of two solutes and a solvent. For the same system, a solvent-free two-dimensional phase diagram can be used to determine (visually) the amount of solvent (or entrainer) needed to break or sufficiently move an azeotrope.

232

A ~ r e yo no yes (~ trying to

~distinguish between~----- no----~ i s o m e ~ UNI~ F

| (Aspen or yes l SMSWIN)

I No methods available (need some

experimental data)

Regress data using Aspen or SMSWlN into appropriate model.

Syngenta ProperLy Metl~d Selection.

properties or No---~ component

\ " ~ known? ~ Use EOS-

Seek Advice yes If only moderate

pressure Use NRTL-HOC

~ s ~ F--~ ~ Wils~ j system at ~ r~o or NRTI-SRK,

low pressure? ~ Wilson-SRK (See rules on 2 "~i\e~ <-i I ua~ ,,., ,___ z liquid phases for . ~ Wilson/NRTL)

yes

/ Are ~ / a n y apour~ n o ~ ~ _ _ n o ~ 21iq uid ~ n o ~Phase Association?\

j ~ phases ~ ~e.gHF,Carboxylic~

( ELErCwNiR:L ) Wilson ~

yes-- NRTL ] i s i UNIQUAC

I Use Wilson-HOC for I Use Wilson-HOC for ] Carboxylic acids Use r Carboxylic acids * ENRTL-HF for HF/H20

Use ENRTL-HF for HF/H20 trylf or nOother UNlFAcParameterSmeth~

~oee Notes on Physical Property Method Selection~ r Contact Joan Cordiner (PSG) Hudd ext 6084//

Figure 3: Decision tree property model selection

9.2.2 Face t s of So lvent -Based P r o c e s s i n g Routes that Need C o n s i d e r a t i o n

As explained in detail in Part-I of this book, there are many issues that need to be considered from materials handling, where solids as well as fluids need to be handled with issues relating to toxicity, environmental impact, etc., through to the issues of packing and formulation. A variety of tools, for example, databases and expert systems, aid the professional in decision points by having the

233

commonly used information easily at hand. It is useful to have a tool that allows a rapid means of reviewing each research process assessing the manufacturing process options. The chemist or chemical engineer can carry out a rapid assessment using diagrams and tables seeing the potential issues easily and quickly.

The reaction chemistry can have a large impact on the choice of solvent-based processing route depending on the yield and selectivity obtained. Typically yield is considered most important given that materials may not be recycled due to complexity of separation. Where recycling is possible, the best route for this needs to be considered and yield is therefore not such an issue. Often, it is possible to significantly alter the known reaction chemistry in products as well as in reaction rate via the choice of solvent(s). This is highlighted for the aromatic nucleophilic substitution of the azide ion with 4-fluoronitrobenzene (Cox [3, 4]), where the reaction rate changes by 6 orders of magnitude depending upon the use of a solvent as shown in Fig. 4. Clearly choice of solvent is very important.

4- N3"

NO 2 NO 2

+ F

Solvent H20 MeO Me2SO H

HC'()NMe2 (Me2N)3PO

ks/kH2o 1 1.6 1 . 3 " 1 0 4 4 . 3 " 1 0 4 2 . 0 " 1 0 6

kH2o- 4.0"10 s Mls 1

Figure 4: Effect of solvent on rate constant

The solvent chosen to enhance the reaction rates, potentially can be used for the subsequent separations required and]or need to be separated. In most fine chemical processes there tends to be more than one stage of chemistry required and often, a different solvent may be chosen for each stage with initial manufactures requiring purification and separation at each stage. This can mean a large number of different solvents are used throughout the process, which involves much separation leading to yield losses. Selecting the solvents for each reaction stage carefully to minimise solvent swaps can be very cost effective and can also increase yield. Clearly any tool that aids solvent selection can radically reduce the capital and operating costs of any route. The tools can lower the experimentation time required by reducing the number of solvents to be tested in the laboratory. One can look at using the tool as doing the experiments quicker. Reducing the experimentation time and hence aid faster development enables

234

more radical processes to be tried. The techniques can then also be used to look at selecting a stage wide or process wide solvent ra ther than having a number of solvents and solvent swaps through the process.

9.2.3 Solvent Se lect ion Methodology in SMSWIN

The solvent selection methodology employed through SMSWIN is based on database search where the solvents are classified according to Chas t re t te [5]. According to this classification system, a solvent is classified as protic or aprotic. Aprotic solvents are further classified as dipolar or apolar. The Chat re t te classification is highlighted for a selected set of solvents from the database of SMSWIN in Fig. 5.

Figure 5: Solvent taxonomy (from SMSWIN)

With SMSWIN, the following steps are performed in solvent selection.

Quick and initial scan: Here, a quick scan of the database is made with respect to some known target properties of the solvent. Solvents having s imilar solubility pa ramete r as the solute is a property tha t may be used. Boiling points, melt ing points, molecular weights, etc., may also be used. Finally, the search is l imited to a class of solvents (aprotic or protic). The solvents obtained from this search could then be fur ther analysed. It is

235

quite possible, however, if the database is not very large that this initial scan may not provide any good candidate solvent.

�9 Detailed search o Solvents matching specified constraints: Under detailed search, a

larger database is used and while some of the properties such as boiling point and melting point may be retained, the solubility of the solute in candidate solvents are calculated (rather than implicitly obtained through solubility parameters). All known solvents of a specified class are considered and only those that do not satisfy the specified constraints are rejected. This also includes solvents for which the property models were not available and]or solvents for which the pure component property constraints could not be checked because of missing data in the database. Therefore, at this step, there will be a large list of candidate solvents that will need to be reduced.

o Analysis of scatter graph: In this step, first a scatter graph of productivity index versus percent recovery of the solvent is plotted for all the solvents. In the next step, those solvents that had missing property model parameters, missing data in database and/or calculated solubility lower than a specified minimum, are removed from the scatter graph. From the remaining solvents, those that belong to a region of potential promising solvents, are retained for the next step (verification of constraints)

o Verification of constraints: The most promising solvents from the last step are verified for specific desired properties of the solvent. For example, in solvents for enhancing reaction rates, ability to cause a liquid-liquid split and no formation of eutectic mixtures are important. The candidate solvents are verified for these properties. Those that remain are further analysed in terms of percentage recovery of the solute and the candidates are ordered in terms of the highest percentage recovery.

�9 Final selection and analysis: The final selection is made from those candidates that satisfy all constraints. At this level, EH&S constraints are added and a further screening is made. From the remaining solvents, one criteria for selection could be cost and]or availability.

9.2.4 S o l v e n t S e l e c t i o n Methodo logy in ICAS

The tool-box in ICAS that performs solvent search/design is called ProCAMD (ICAS Documentations [6]). It is based on the hybrid CAMD method described in chapter 6 of this book. ICAS also has a database where a preliminary search can be made based on the specified pure component property constraints and solubility parameters. Alternatively, the search for solvents from SMSWIN could be used as the pre-design phase for ProCAMD. Once the problem has been specified in terms of property constraints and building blocks to be used,

236

ProCAMD generates and tests candidate solvents. All these solvents satisfy the specified property constraints that can be directly calculated through in-house models in ICAS (ProCAMD). Like SMSWIN, the EH&S property constraints are considered in a separate step (post-design phase of the hybrid CAMD method of chapter 6).

9.3 CASE STUDY

9.3.1 Ni tr ic Acid O x i d a t i o n of A n t h r a c e n e to A n t h r a q u i n o n e

This case study highlights the use of SMSWIN and ICAS-ProCAMD (chapter 6) for the solution of a problem involving process wide solvent selection for the nitric acid oxidation of anthracene to anthraquinone. The techniques looked at solvent effect on reaction, solubilizing the start ing material, solubility of nitric acid, recovery of product, separation scheme for recovery of solvent, recovery of nitric acid, boiling point and melting point, vapour pressure, price, safety, toxicity factors and many more. The solution details are taken from the MEng-thesis of Bavishi [7].

The problem is first solved with SMSWIN and with the generated problem definition information, the solution of the problem has been repeated with ProCAMD.

S o l v e n t S e l e c t i o n Criter ia

Based on the processing constraints, the following desired properties for the solvent are needed.

1. Anthracene has to be soluble in the solvent at 145~ The solubility is approximately 0.27 by mass fraction in the existing solvent at the reaction temperature. So ideally we prefer the new solvent to have solubility greater than that.

2. Recovery of Anthraquinone, the product, from the solvent. Ideally prefer to achieve greater recovery of the product than in the current solvent. Also need to ensure that no eutetic is formed when the product is crystallised.

3. Solubility of Nitric Acid in the solvent needs to be high in order for the instantaneous reaction between the Nitric Acid and Anthracene to take place.

4. Reactivity of the solvent with Nitric Acid, Anthracene and Anthraquinone will need to be known. The solvent in this case is simply a reactant carrier and does not appear in the reaction mechanism. Therefore the solvent should not participate in the reaction.

5. Solvent used needs to be immiscible with water. The process is designed to t reat such solvents. Therefore the solvent chosen should form an azeotrope

237

with water, where the liquid splits into two liquid phases with different compositions.

6. The chosen solvent should have a min imum boiling point of 145~ because the reaction t empera tu re is 145~ At this t empera tu re the solvent should be a liquid for liquid phase reaction.

7. The chosen solvent should have a max imum melt ing point of 25~ because the product is crystall ised at 25~ This will minimise the chance of solvent to be crystall ised out with the product.

8. The solvent will be released to the environment via the effluent s t ream and via vents. Therefore we want a solvent, which is environmental ly friendly.

9. The solvent used should also be economically favourable. This factor should not be of a great concern as long as a majority of the solvent is being recovered. If the solvent used requires addition of make-up of fresh solvent feed for each batch of reaction, then the cost of the solvent would be a major criterion.

S o l v e n t S e l e c t i o n U s i n g SMSWIN and P u b l i s h e d Data .

Quick Scan: We need to select a solvent tha t would not part icipate in the reaction and would be immiscible with water. Since we want to find solvents tha t are s imilar (or better) in solubility than the known solvent, we want to find similar solvents tha t do not part icipate in the reaction. So, we are looking for dipolar aprotic solvents with solubility paramete rs close to tha t of an thracene (the solute). Tha t is, we are looking for dipolar aprotic solvents tha t have solubility pa rame te r s < 9.7 (cal/cc) ~ and > 8.5 (cal/cc)~ compounds matching this specification are:

Acetone 9.62 2-Butanon 9.45 Hexamethy l phosphoramide (HMPA) 8.58 Methyl propyl ketone 8.99

Checking the boiling points for the above compounds showed tha t except for HMPA, all other compounds have boiling points > 145~ This means tha t only HMPA is a suitable solvent. However, since very little data and/or property model pa ramete r s are available for this compound, this was also rejected. Consequently, the quick scan did not produce any candidates.

Detai led Search: In the next step a detailed search was made within the solvents da tabase of SMSWIN. For all dipolar aprotic solvents, those tha t had a boiling point < 145oC and a melt ing point > 25oC were collected. If a compound did not have these properties in the database, they are not rejected. For all the compounds (solvents) tha t were retained, the solubility of an thracene was calculated. Again, if the property model pa ramete rs were not available, they corresponding solvent was not rejected. This gave a large search space.

238

Scatter Graph: In this step, first productivity index and % recovery of the solute is calculated. Then the calculated data for all the solvents are plotted (as shown in Fig. 6a). In the next step, the number of solvent candidates is reduced by employing the following cons t r a in t s -

Solubility of anthracene > 0.3 mass fraction at 145~ only solvents with all property model parameters must be considered and only solvents with known melting point and boiling point temperatures must be considered. This produced a much smaller number of solvent candidates located in a well-defined region (see Fig. 6b).

Scatter graph for temperature range: 25.0~ to 145~ o

93 94

+++ +

+ %+ I I I + ....

+ I. + :

95 97 96 99 0/. Recovery

Scatter graph for temperature range: 25.0~ to 145~

I + + ++ ,+ + ++ + +

+*'* '*+*+ I"* .i.+~.i.., +

94 95 98 97 98 98 % Recovery

Figure 6a: Scatter graph before Figure 6b: Scatter graph after screening of solvent candidates screening of solvent candidates

Verification of Constraints: The next step has been to verify that each of the candidate solvents from Fig. 6b actually caused a liquid-liquid split and did not form an eutectic mixture.

Final Selection and Analysis: The final list of candidate solvents are listed in Table 1, ordered in terms of % recovery of anthracene. Now adding EH&S constraints listed in Table 2, the final list is reduced to those that are marked in bold letters in Table 1. Out of the five solvents that have satisfied all constraints, te tral in happens to be easily available and is also the cheapest.

S o l v e n t Se l ec t i on Us ing ProCAMD

Using the knowledge (information) generated from the use of the solvent search method in SMSWIN, the same problem is formulated in ProCAMD. The search is made separately for acyclic, cyclic and aromatic compounds. Within each molecular class, molecular types may be pre-selected and this in turn will select the building blocks for the CAMD-phase. Figures 7a and 7b show the general problem specification in ProCAMD. It can be noted from Figs. 7a-7b that

239

preselection of molecule types also means automatic selection of the building blocks.

Table 1: List of solvents satisfying all property constraints except EH&S properties.

SOLVENTS

Acetophenone Benzyl chloride

1-Chloronaphthalene 4-Chlorotoluene

PERCENTAGE RECOVERY OF ANTHRAQUINONE (%)

97.4 97.1 97.5 97.8

2-Chlorotoluene 97.8 1,4-Dichlorobutane 97.6

1,4 Dichloro-2-butene,trans 97.4 m-Divinylbenzene 4-Ethyl-m-xylene m-Ethyltoluene

1 -Heptanal Indane Indene

Mesitylene 1 -Methylindene

1-Methylnaphthalene 1-Methyl-3-n-propylbenzene

o-Methylstryrene

98.2 98.0 97.9 98.4 97.9 98.0 97.6 98.2 97.9 98.0 98.o

Prehnitol 97.5 trans- 1-Propenylbenzene 98.2

IsoPropylbenzene N-Propylbenzene

Pseudocumene

98.3 98.2 97.6

1,1,2,2- Tetrabromoethane 98.0 Tetralin 98.0

1,2,3,5-T etramethylb enzene 97.5 1,2,3- Trimethylbenzene 97.6

The property constraints are given in terms of non-temperature pure component properties, temperature dependent pure component properties, mixture properties and azeotrope/miscibility calculations (see figure 7e for more details on the target properties). Note that the solubility parameter is calculated at 298 K. Among the temperature dependent properties, only the vapour pressure is specified > 0 & < 0.0013 (gm 3) at 298 K.

240

Table 2: EH&S property constraints

CLASS , , ,

H1

H2

M

PROPERTIES

Very Toxic Respiratory Sensitisers

Potent Carcinogen Toxic

Corrosive Animal Carcinogen

Harmful Skin/Eye irritants Non Hazardous

Non Irritant Non Genotoxic

RELEASE RANGE

< O. 1 mg/m 3 < O.lppm

> O. lppm < 10ppm < lmg/m 3 < 500ppm < 10mg/m 3 > 500ppm > 10mg/m 3

Figures 7c and 7d show the specifications for the mixture properties and the azeotrope/miscibility calculations within ProCAMD. The UNIFAC model is selected and anthracene is selected as the solute that needs to be extracted with the solvent (Fig. 7c). From Fig. 7d, it can be noted that an azeotrope with water is specified and a liquid phase split is also specified.

Figure 7e shows a typical screen shot when ProCAMD has finished the calculations in the CAMD-phase. ProCAMD did not find any cyclic compounds (because of the limitations of group parameters within the property models) but it did find acyclic compounds and aromatic compounds, listed in Table 3. One of the compounds, 1-Methyl-3-n-propylbenzene has already been found through SMSWIN (see Table 1). Therefore, the post-design phase was not continued further since the analysis had already been done through SMSWIN.

Table 3: List of feasible compounds from ProCAMD

ACYCLIC

SOLVENTS

n-Decylacetate

1-Undecanal

n-Nonylacetate

1-Decanal

Methyl decanoate

AROMATIC SOLVENTS

1,2,3,4-Tetramethylbenzene

1-methyl-3-n propylbenzene

CYCLIC

SOLVENTS

No molecule met the

specifications

241

Figure 7a: General problem specification in ProCAMD

Figure 7b: General problem specification in ProCAMD

Figure 7c: Mixture property specification

Figure 7d: Azeotrope /miscibility calculation specifications

242

Figure 7e: Screen shot of results from ProCAMD

9.3.2 Case Study 2: Solvent for Dehydration

In this example the problem is to find a solvent to replace toluene as an en t ra iner in batch dehydration, which is the bottleneck in this stage of a processing route. The exist ing process operation is carried out by the addition of toluene to a batch dist i l lat ion column with a decanter to recover ent ra iner from the distillate. The feed to the system contains a number of products including the in te rmedia te to an agrochemical. The two key components are, however, Dimethyl acetamide (DMAC) and water. The other components can be ignored due to their high molecular weight and small impact on the VLE of the water-DMAC system. The current system employs an ent ra iner as DMAC hydrolyses with water par t icular ly at elevated tempera tures hence toluene as an ent ra iner was selected to allow the separat ion at lower tempera tures . The new solvent would need to fit into the existing equipment with minimal changes required. In addit ion the pur i ty of the agrochemical in termediate product s t ream passing to the next stage of the process should remain the same as with toluene as the entrainer .

The following targets need to be matched by any solvent to be selected.

�9 Final water content of the in termediate product s t ream should be less than 9 kmol.

�9 DMAC losses to be controlled by t empera tu re (< 117~

243

�9 A m a x i m u m of 20 kmol of ent ra iner can remain with the in te rmedia te product s tream.

�9 Batch dehydrat ion t ime should decrease in order to reduce cycle t ime and DMAC losses.

�9 DMAC loss in distil late should be a max imum of 0.3 kmol%.

Based on the above targets, the selected ent ra iner needs to have the following properties. Environmenta l and toxicity constraints are not considered at this stage but will be analysed in a post-design stage (not highl ighted in this case study).

�9 Form a heterogeneous azeotrope with water with a boiling point below 117oC.

�9 The liquid-liquid split should be at least as good as toluene. �9 Separat ion of DMAC and the ent ra iner should be good, i.e. no azeotrope

should form between the ent ra iner and DMAC and the solvent power should be high.

Applying the ProCAMD program, the following candidates have been found. Figure 8a shows the screen shot from ProCAMD highlighting the solution details. Figure 8b confirms tha t the subst i tute en t ra iner satisfies the desired (target) properties. The next step would be to perform batch distil lation s imulat ions to verify the functional (operational) target properties and to analyse the envi ronmenta l and toxicity constraints.

Figure 8a: Problem specification details and solution statistics from ProCAMD

244

Figure 8b: Problem specification details and feasible solvent from ProCAMD

9.4 CONCLUSIONS & FUTURE CHALLENGES

Many of the typical processes contain very complex molecules of which there is little information. These complex molecules have many functional groups and be in the presence of similar molecules which are produced as by products or as pre or post stage products. Indeed many final molecules are required in a particular enatiomer. Some typical molecules are shown in Fig. 9 (from Carpenter [2]). The selection of the separation task therefore becomes complicated. It is important therefore to have good predictive tools for the important physical properties and the ability to improve these predictions with as much known information as possible. This sort of tool has been developed by the CAPEC Group at the department of chemical engineering of the Technical University of Denmark. There are however ways forward by using as much information as available from the molecule and similar molecules to give some guidance. This is where using the tools along side experience and experiment can work very well.

245

Br F, O\ + _ _ ~ B r N~O

H .'"~ ~ - ~Me Cl Ii o==~ O N

O % O N

P O ~ § O Nit

o F o

A substi tuted diphenyl A green azo dyestuff for A synthetic pyrethroid ether used as an dying polyester insecticide

herbicide Figure 9: Typical examples of complex molecules (solutes).

It is common in many processes to have by-products and intermediates tha t are very similar in structure to the product, indeed it is also common to have enantiomers where one is the active compound and all other enantiomers inactive. This makes the separation selection and also the prediction of the properties more difficult. Measurement of the required physical properties can also be problematic due to the difficulty of producing a pure sample of any byproduct. There is therefore a substantial gap in the currently available property prediction methods to be filled.

The currently available CAMD methods and tools (see Par t I of this book) need to be further developed to take account of wider solvent issues and could also be widened to route selection including formulation of active products, for example, surfactant selection. In addition visualisation tools along with optimisation tha t allow selection of separation schemes taking into account efficiency of separation (Bek-Pedersen et al. [8]) will prove very useful. Solvent selection tools will also be greatly improved when reaction effects are better predicted. Finally, early evaluation tools are proving very useful in improving solvent-based process route selection practise, bringing chemical engineers and chemist together and facilitating co-current development that is focussed much earlier reducing the necessary experimentation and development time-scales.

ACKNOWLEDGEMENTS

Permission to publish from Syngenta is gratefully acknowledged. Thanks to a great many friends and colleagues for advice and information, especially: Dr

246

Keith Carpenter and Dr. Alan Hall, Dr Will Wood of Syngenta Technology and Projects and James Morrison Consultant.


1. K.J. Carpenter, "Chemical Engineering in Product Development- The Application of Engineering Science", Entropic, 223 (2000).

2. K.J. Carpenter, 16th International Symposium on Reaction Engineering (ISCRE 16), 2001.

3. B. G. Cox, "Modern liquid phase kinetics", Oxford Chemistry Primer Series 21, Oxford University Press, UK (1994).

4. B.G. Cox and A. J. Parker, J. Am. Chem. Soc., 95 (1973) 408. 5. Chastrette, JACS, 107 (1985)1-11. 6. ICAS Documentations, Internal Report PEC02-14, CAPEC, Department of

Chemical Engineering, DTU, Lyngby, Denmark, 2002. 7. P. Bavishi, MEng Thesis-2000, Department of Chemical Engineering,

Imperial College, London, UK (2000). 8. Bek-Pedersen, E., Gani, R., Levaux, O., Computers and Chemical

Engineering, 24 (2000) 253-259.


Chapter 10: Case Study in Optimal Solvent Design

M. Sinha, L. E. K. Achenie & G. M. Ostrovsky


Solvents are extensively used as a major component of ink in the printing industry. The function of solvents in ink is to act as a vehicle for polymeric resins, pigments and dyes. The ink solvent also assists in wetting and dispersion of dyes and pigments. In letterpress and offset lithographic printing processes, the ink is carried to the plate by means of a t rain of rubber rollers commonly called "blankets" as shown Fig 1. Thus a thin film of ink is distributed over a large surface area on the blankets. These ink solvents are volatile and evaporate to leave behind the pigments and resins on the blanket surface. Cleaning is required whenever the residue build-up affects the print quality and between print jobs. Paper fibres, ink residue, paper coating and dried ink, are types of material that must be removed from the rubber blankets.

Figure 1: Schematic of Lithographic Printing

One of the most used solvents in lithographic printing is the '%lanket wash" which is specially formulated to clean ink and other residue from rubber blankets. Blanket cleaning is accomplished automatically or manually. In an

248

automatic blanket wash process, as shown in Fig. 1, the blanket wash is jet sprayed onto the blanket. Therefore a large amount of the wash is lost by evaporation even before it makes contact with the blanket.

Blanket wash solvents are mostly solvent mixtures as opposed to single component solvents. As such, next to solvent performance, one of the most pressing concerns of the printing industry with regard to the environment is the volatile organic component (VOC) level of solvents. At present the VOC levels of solvents used in the printing industry are unusually high, well over 80% and far beyond the industry target of 30%. For example, a commonly used blanket wash, "VM&P naphtha" has a 100% VOC content (United States Environmental Protection Agency, 1997a).

Blanket washes and solvents for "rag and bucket" operations are chosen based on their performance and their impact on the environment, health and safety. There is a wide variation in the performance attributes of cleaning solvents by different vendors. To enhance the cleaning operation, companies sometimes mix solvents from different vendors. However, this trial and error approach is costly and may not necessarily yield the solvent mixture with the desired performance attributes. In addition, the solvent for a cleaning operation may not meet safety, health and environmental restrictions.

Another important issue is minimizing the effect of a solvent on the surface characteristics of the rubber blanket by inducing swelling. Swelling severely affects the print quality in lithographic processes. Thus, there is a need to account for this in blanket wash design.

The goal of this case study is to design globally optimal solvents to be used for cleaning in lithographic. These solvents should (i) have a minimal drying time, (ii) dissolve residue ink, (iii) not swell the blanket, and (iv) be environmentally benign. Drying time is correlated with the heat of vaporization of the solvent. The ink residue is assumed to consist of phenolic resins.

10.2 P R O B L E M DEFINITION

The problem as posed can be modelled as a multicriteria optimization problem. However, in the printing industry, there are rather loose and minimal requirements on these attributes. Therefore these attributes are regarded as constraints with given targets (similar to goal programming, Tamiz, 1996). A straightforward approach to modelling the problem as a special kind of multicriteria problem is to consider a lumped objective in which the different criteria appear as terms with appropriate weights. However this approach forces the solvent formulation engineer to think of appropriate weights (usually of no physical meaning) to employ, a rather non-trivial task. A more meaningful and

249

rigorous approach is to consider the problem as a multi-level optimization problem. The latter is rather difficult to solve and has usually been restricted to bi-level optimization problems in which the decision variables are continuous.

We reiterate that the goal of this case study is to design optimal solvents to be used as cleaning agents in the printing industry. These solvents should (i) have a minimal drying time, (ii) dissolve residue ink, (iii) not swell the blanket, and (iv) be environmentally benign. Drying time is correlated with the heat of vaporization of the solvent.

The ink residue is assumed to consist of phenolic resins. Solvents that can effectively dissolve the ink residue obey the solute-solvent interaction

R ~ =4(5 D -SD) 2 + (6p-Sp) 2 + ( 5 . - 6 . ) 2 _<(R*) 2

where 5d*=23.3, SP*=6.6, 5D=8.3 and (R*)2=19.8, and 5d,, 5p, 5D are determined from a model, for example a group contribution model (see Table 1).

The heat of vaporization, boiling point and melting point solvent properties are calculated using the Constantinou and Gani (1994) method. The fragment-based method is used to calculate Kow (Lyman et. al., 1981). The group contribution parameters for solubility parameter calculation are based on van Krevlen and Hoftyzer's method (Barton, 1985). The models and their reference are summarized in Table 1.

Table 1. Property Prediction Models for CAMD_I and CAMD_2 Property Reference

Solubility Parameter Barton, 1985 Boiling Point Constantinou and Gani, 1994 Melting Point Constantinou and Gani, 1994

Heat of Vaporization Constantinou and Gani, 1994 Parti t ion Coefficient (log Lyman et al., 1981

Kow)

We note that the nonlinear property prediction constraints ((pj in PMD) do not employ the zij; and wi variables from the Churi-Achenie octet rule implementation (see Chapter 3). Thus the problem is nonlinear with respect to only the uik variables. In the property prediction models, the nonlinearities are present in all the uik variables. The estimators for the case study are constructed in the appendix. These estimators are then used in the proposed branch and bound technique to solve the problem.

250

10 .2 .1 C a s e S t u d y C A M D _ I

In this case study, s t ructural feasibility constraints are employed to ensure feasible molecular structures. For simplicity introduce the notation

Iprl = ~ H ' I]'/2 -- l / /P ' I ] /3 "-I/J'v ' 1[]4 -- IprD

The resul t ing molecular design formulation is shown below.

CAMD_I:

subject to

min ~ ~ u o . ( A H v ) j (1) i j

Z Z /g/J -~ "max ( 2 ) t j

~ u~j(2 - v j ) = 2 (3) i j

exp(( ~ ~ uo. (Tb)J) / 204 .4) > 323 (4) i j

exp(( ~_~ ~_, u ij ( Tm ) j ) / 102 .425 ) < 223 i j

(5)

~_,~_u~(Z~ + ~, ,~_uo(Z') j < 4.0 (6) i j i j

4(5 D -23.3) 2 + (Sp - 6.6) 2 +(5 H -8 .3) : < (19.8) 2 (7)

D - 6.31/tv > 0

~/[i -~ llf i -~ ~/[ i ' i = 1,2,3,4

(8)

(9)

To solve CAMD_I: we proceed as follows

Step 1"

(a) Decide on the set of groups to be used to form compounds. We choose as basis set twelve groups, namely CHa-, CH2-, Ar-, -Ar-, -OH, CHaCO- , -CH2CO-,-COOH, CHaCOO-,-CHeCOO-,-CHaO, and-CH20-.

251

(b) Identify the design variables. These are given by the structural variables u/j,

which determine whether a particular structural group is present in the molecule.

Step 2" Identify the performance objective. The performance objective is given by the double summation in Eq. (1), which gives the heat of vaporization of the compound.

Step 3: Identify the constraints. Constraints are employed in order to ensure that the last seven groups in the basis set are not allowed to occur more than twice in a compound as follows ~u~ < 2.

j = 5 ..... 12

The constraint Sp _> 6.3, will ensure minimal blanket swelling. The environmental impact of solvents is accounted for by requiring that the maximum value of the partition coefficient (log Kow) be 4.0. To ensure that the solvent is a liquid at ambient temperature, the limits on boiling point (Tb) and melting point (Tin) are imposed. The constraints are Eqs. (4) through (9). Eqs. (4) to (7) are the property target constraints on blanket swelling, and Eq. (8) are constraints imposed by the branching functions. Eq. (9) are simple bounds on the branching functions.

Step 4: Decide whether to use the Odele-Machietto or the Churi-Achenie Octet Rule Model. Here we employ the much simpler (although restrictive) Odele- Machietto model for acyclic compounds where vj is the valence of jth structural group. The model is given in Eq. (3). We also include the molecular structural constraints (Eqs. (2) and (3)).

Step 5" Using information from previous steps, assemble the mathematical program, i.e. the performance objective, constraints, design variables and the Octet Rule Model. Eqs. (1) through (9) make up the mathematical program.

Step 6: Construct linear estimators of the performance objective and the constraints. The simple example in Chapter 3 gives an illustration of how to do this; also see the appendix in this chapter.

Step 7: Enter an iterative loop using the branch and bound (BB) procedure in Section 3.3.1 of Chapter 3. There are two nonconvex constraints. The splitting functions employed are ~D, ~P, I~rH and ~y. The MILP solver used is a public domain code lp_solve by Har tmut Schwab available at (ftp.es.ele.tue.nl/pub/lp_solve). This solver uses the simplex algorithm, lp_solve uses a ra ther simple depth first strategy. Identify the optimal molecule using information from the solution.

252

Three different runs were invest igated for case study 1. The three runs correspond to n~ax of 3, 4, 5, 6, 7, and 10 (CAMD_la , CAMD_lb , CAMD_lc , C A M D _ l d , CAMD_le , and CAMD_lf , respectively). The corresponding problem dimensions are 36, 48, 60, 72, 84 and 120. For all cases the number of constra ints are 15. The terminat ion criterion used is an absolute tolerance of 10 .3 . The resul ts are shown in Table 2.

Problem C A M D _ l a has a very l imited search space. A feasible solution was found in the first i teration in the branch-and-bound algorithm. In CAMD_lc , the a lgor i thm took 31 i terations and 351.4 seconds on a 333-MHz DELL Pen t ium II personal computer. The max imum number of sub-regions constructed is 16. The globally optimal solution corresponded to methyl-ethyl ketone (MEK or CH3- CH2-CO-CH3) with objective function 35.471 kJ /mole . This compound was found at the 10 th i terat ion with a valid upper bound of 35.471 and a lower bound of 33.99. Since the difference between the upper and lower bound was more than the tolerance, the algorithm continued executing. The algori thm finally converged to MEK as the global solution after 21 more iterations. The two other feasible compounds found were propanol (CH3-CH2CH2-OH) and diethyl-ketone (CH3-CH2-CO-CH2-CH3). The objective function values for propanol and diethylketone were 44.77 kJ/mole and 40.12kJ/mole, respectively.

Table 2: Application of Reduced Space BB algorithm to CAMD_I Case nmax Variables Constraints Iterations CPU time Max number

(min) of subregions CAMD_la 3 36 15 1 0.045 1 CAMD_lb 4 48 15 18 0.86 12 CAMD_lc 5 60 15 31 5.85 16 CAMD ld 6 72 15 42 17.21 20 CAMD_le 7 84 15 46 48.45 21 CAMD_lf 10 120 15 67 713.5 21

We note tha t at any iteration, the solution of the relaxed MILP problem is a s t ruc tura l ly feasible compound since all the s t ructural constraints are linear. During the execution of the algorithm, fifteen different compounds were found. Of these, two other compounds satisfied the specified or performance constraints. For case CAMD_le , the number of i terat ions is 46 and 3 compounds are designed. The max imum number of subregions created is 21. In CAMD_lf , the number of i terat ions is 67. The max imum number of subregions created is 21. Even though the number of i terations does not grow very much, the CPU t ime increases. This is because the CPU time associated with each LP solution increases significantly when the number of variables increases. Another desirable property of this algorithm is tha t a very small number of subregions are created.

253

For the three cases, the number of subregions created are 16, 21 and 21, respectively. Thus the algorithm is very efficient in terms of storage requirements. It should be noted that as the dimension of the problem increases from 60 to 120, the number of iterations only increases from 31 to 67. This is perhaps the consequence of the fact that the number of branching variables, namely 4, is the same in all the cases.

Recall that in all the example problems above, although the number of variables uij increased from 60 to 120, the number of branching functions is unchanged at 4. In contrast, if we employ the standard full space BB algorithm, we will need to perform branching with respect to all the variables ui. Here, the number of branching variables ranges from 60 to 120.

10.2.2 Case study CAMD_2

In this case, the same formulation is solved with the Churi-Achenie model (see Step 4 above). The connectivity variables z and w are employed in the structural representation as described in section 3 of chapter 3. The second constraint in CAMD_I is replaced by the following set of structural constraints. This leads to a large increase in the number of linear structural constraints.

m sm~ m

y~ Z up - ~ 1 u i~ v k i - 1...nmax p = l j = l

i - I smax ~_zij p > -w i i = 2 . . . . n m a x

p = l " j l

nm ax m n m ax

Z Z u i k + Z w i = n m a x i= l k=l i=l

(10)

(11)

(12)

wl=O (13)

W i ~-- W i + l i = l . . . ( n m a x - 1 ) (14)

~-~ Z Zi jp q- M u i k < M j=u+l p = l

S ~ S m,,r

n

p~l Zijp <1

m m

Z U i k - Z U i _ l , k < 0 k=I k =I

i = l . . .nm~x,k = l .... m

i = l...(nmo~ - l ) , p = (i + l) . . .nmo~

i = 1. . .nm~, j = 1 . . . s , ~

i = 2...n,,~x

(15)

(16)

(17)

(18)

254

This formulat ion is solved for nmax equal to 3, 4 and 5 (CAMD_2a, CAMD_2b, and CAMD_2c). The numbers of search variables are 57, 84 and 115 respectively. The corresponding numbers of constraints are 67, 84 and 113. Note tha t the formulation is nonlinear with respect to only the uih variables. The resul ts are summar ized in Table 3.

The number of variables tha t part icipate in the nonlinear te rm is the dimension of uik variables. The remaining variables determine the connectivity information and appear only in the l inear te rms in CAMD_2. The dimensions of the vector of variables uik in the three runs are 36, 48 and 60 (CAMD_2a, CAMD_2b and CAMD_2c).

Table 3: Application of Reduced-Space BB algorithm to problem CAMD_2 Case nmax Variables Constraints Iterations CPU Max number

time of subregions (min)

CAMD_2a 3 57 67 1 0.1 1 CAMD_2b 4 89 89 18 3.36 9 CAMD_2 c 5 115 113 22 14.5 11

CAMD_2a corresponds to a problem with a reduced search space restr icted by n m a x - 3. For this case the global optimal solution was found in only one iteration. When the search space was increased to 89 and 115 variables, the number of i tera t ions also increased to 18 and 22. For the run CAMD_2c one of the feasible compound found in an in termedia te step is -CH20-CH2COO-CH20-CH2-, a cyclic compound. The s t ructural constraints used in case study 2 allows design of cyclic compounds. The constraints in case study 1 are restr icted to only acyclic compounds.

For about the same number of variables, the number of i terat ions in case s tudy 2 (CAMD_2) is relatively smaller than case study 1. In addition, the m a x i m u m number of nodes generated in case study 2 is much smaller tha t in case study 1. This can be a t t r ibuted to the fact tha t in CAMD_2 the number of variables appear ing in nonlinear te rm is much smaller compared to problems of s imilar dimension in CAMD_I.

10.2.3 Case Study CAMD_3

In this case study, solvents are designed with entirely different criteria. Here the most desirable a t t r ibute of the solvent is recoverability. That is, after the b lanket wash operation is performed the solvents compounds tha t evaporate are recovered by a solvent recovery system. This case study a t t empts to find a solvent compound tha t will be least expensive to recover. Many competing

255

solvent recovery techniques can be applied, namely condensation, gas adsorption and gas absorption. Here the recovery system is restricted to the condensation.

A typical condensation recovery system consists of a compressor tha t takes in the pr in t ing solvent-laden exhaust gases (from the venti lat ion system) and compresses them to a higher pressure. These high-pressure gases are passed through a condenser tha t cools this s tream. Next it is flashed to recover the solvent. The details of the recovery operation have been discussed elsewhere (Sinha, 1999). Here the objective is to find the solvent compound tha t will have minimal total annual ized cost (TAC) associated with recovery. Here we will use as branching functions (except for the functions in (6.4)) the following functions

~5 = Hvo + Z ~.uuHvj , IPr6 = Vo "~ E Eu~T~j , IV 7 = Pcomp, and ~8 = T~o.d �9 i j i j

The CAMD_3 ease study with recovery considerations is:

minTAC=85675*(Ps176163 I v o " V M

subject to:

,~_, ,~_ u U _< 4 i j

(19)

~_~ ~.,uij (2 - V j ) = 2 i ]

(20)

log10 (Vm) - l~ ( P ' c o m p ) - 2.7(T8 / Tco.d ),.7 < -11.47 (21)

exp((~_~ ~_ Uutmj ) /102.425 ) < 223 i j

(22)

500 < Tbo + Z Z uijTbj < 700 i j

(23)

20 < Hvo + ~ ~_~ uijHvj < 80 i j

(24)

~_~_u~j(Z~ + ~_,Euu(Z') j < 4 . 0 i j i j

(25)

4(5 D -23.3) 2 +(Sp-6.6) 2 +(5 M -8.3) 2 <(19.8) 2 (26)

256

'g~ - 6.3Vv -> 0 (27)

~i ~Vi <~i' i=1,...,8 (2s)

where Eq. (20) is Odele's octet rule implementation, Eqs. (21) to (27) are recovery, melting point, boiling point, heat of vaporization, octanol-water partition, solvent power and swelling constraints, respectively, Eq. (28) represent constraints imposed by the branching functions.

The following modified basis with 15 groups is used in this study: [CH3-, CH2-, - OH, CH3CO-,-CH2CO-,-COOH, CH3COO-,-CH2COO-, -CH30, -CH20-, CH2=CH-, -CH=CH-, -CH2NH2, =CHNH2, CH3NH-]. The aromatic groups are removed and some groups with nitrogen are added to include amine or other compounds with nitrogen.

There are a total of 8 splitting functions. The last 4 splitting functions are used for construction of linear underestimators for the objective function and underest imator and overestimator for the recovery constraint. The construction of estimators is discussed in the attached appendix. This case study has 60 variables and 3 nonlinear constraints. Moreover, the objective function is nonlinear. The condenser temperature however can range between 150 K and 298 K. This results in poor scaling and causes difficulty during optimization. To overcome this we have scaled the condenser temperature between 0.1 and 0.9 such that T" = 185Tcond + 131.5 where T ' i s the scaled condenser temperature.

The globally optimal compound designed is a diester with the structure CH3- (CH2COO)2-CH2NH2. The recovery cost associated with this compound is $25,981. The corresponding compressor pressure is 2 atm and the condenser temperature is 288.75 K. The algorithm took 56 iterations and a CPU time of 41.6 seconds. At termination, the number of nodes (i.e. subregions) is 20.

The above problem was solved again with local optimization software D I C O P T in GAMS (Brooke, 1996). The optimal compound found by D I C O P T is HO-CH2COO- CH3NH, an ester. The objective function associated with this compound is 106,327, the compressor pressure is 10 atms and condenser temperature is 298 K. We note that the extra effort associated with the global optimization is justified and results in almost 4 times reduction in the recovery cost.

10.3 D I S C U S S I O N AND CONCLUSIONS

The molecular design problem is reduced to solving an MINLP problem in which the number of binary variables uij can range from several tens to several hundreds. The use of the standard branch-and-bound method for solving the

257

problem can be computationally intensive since all the variables uij must be used as branching variables. To overcome this problem, we have proposed a new strategy. The main idea of the method consists in that we do branching using branching functions instead of all the search variables. This approach results in a decrease in the number of branching variables in our molecular design framework. For example, in case study 1, a problem with 120 nonlinear variables is solved with just 4 splitting variables. This is also demonstrated in the case studies. The maximum number of nodes stored in memory during the search is 21 for CAMD_le and CAMD_lf and 20 for CAMD_3.

In other words, during branch and bound, the bounding operation is performed the search variables space, while the branching operation is performed in a reduced dimension space defined by the branching (or splitting) functions.

The branching functions are determined from the special tree function representation of both the objective function and constraints. In order to construct the corresponding linear underestimators, we employed the sweep method we developed in our research (Sinha, Achenie and Ostrovksy, 1999) and (Ostrovsky, Achenie and Sinha, 2000).

The proposed algorithm scales well. Specifically, as the problem size increases, the computational effort increases almost linearly. We anticipate that this linear behavior will be exhibited also in large molecular design models.


See Chapter 3

10.5 A P P E N D I X : C o n s t r u c t i o n o f E s t i m a t o r s

One very important property for solvent is its ability to dissolve the solute. A solute-solvent interaction is often characterized by the Hansen solubility parameter 57' (Archer 1996). This parameter is characterized by the three intermolecular interactions, namely hydrogen bonding interaction (SH), polar interactions (SP) and nonpolar (dispersive) interaction (51)) (Hansen, 1971). The mathematical expression for the solvent selection criterion based on the Hansen solubility parameter is

R U = 4(6z)-5~)) 2 + ( S p - S p ) 2 + ( 5 . - 5 ~ ) 2 < (R* ) 2 (29)

o _ v , o . _ , , - - ( 3 0 ) IVv IVv ~Vv

258

where

* F~j;~,~ = E E , o . *(F~j) ~- ~ = E Z u U i j i j

~,, - E E u~ * ( - u , o); q6, - Vo + E E u~V~ i j i j

(31)

Here FDi, FPi and UHi are the group contribution parameters associated with the dispersion, polar and hydrogen bonding solubility parameters respectively (Barton 1985). Subst i tut ing Eq. (30) into Eq. (29) we obtain

(32)

Note tha t the nonconvex Hansen solubility design criteria make the solvent design problem mult iextremal. R* is the interaction radius associated with the solute. The distance between the solute and solvent solubility pa ramete r s is RiJ and can be computed as shown in Eq. (29). ~ is the molar volume of the solvent. We will now construct l inear underest imators for this impor tant constraint. Eq. (32) is made up of four separable terms. The first and the fourth te rms are squares of l inear equations. The second and the third terms are relatively more complicated. Using Eq. (30) one can obtain the STF representat ion of the third te rm in the form

~? - ( ~ ) ~ ,

<P~ = ~H and ~p~ =~v

Here the functions ~p~,<p~, <p~ , <p~, <p[, (p~ corresponds to the fifth, third and second

levels respectively.

The first level functions are the variables uik. The branching set branching functions contain ~p~,(p~,~p~,V,,V/p,,gv,~. It is easy to see tha t the first three

functions are expressed as functions of the variables ~ , ~p, ~ , , ~ . Therefore the

functions ~H, ~p, ~ , , ~ can be used as branching functions. Let us consider for

i l lustrat ion construction of an underes t imator for the third term in Eq. (32). Firs t we need to find bounds for<p~, <p~ , ~p~, <p~, ~p~. Since we know the bounds for uik

we can determine the bounds for ~p~, <p~. The ranges of the functions <p~, ~p~ are

used to construct the ranges of ~p~ and <p~ at the third level. These are then used

to construct the range of <p~, at the fourth level. Thus the bounds are es t imated

in a bottom up sweep.

259

The l inear unde res t ima to r s are constructed in a reverse sweep t h a t s t a r t s at the fifth level and goes down. First , a l inear unde re s t ima to r of ~p~ is const ructed in

t e rms of ~p( as follows L[~p~ ;S 4 ] = p~ (~p()+ P2- Here S r - ((p(,~p(). The sign of pl,

which depends on the in terva l (~4,~4), de te rmines whe the r the function p~(~p()

is convex or concave wi th respect to variables tp~ and ~p~. The u n d e r e s t i m a t o r

now has the following form

The signs of (pa-#4) and ~ a +p4) determine if the corresponding functions are concave or convex. Subsequent ly , the unde res t ima to r is const ructed wi th respect to 1VH and ~y. After r ea r r ang ing the terms, the l inear unde re s t ima to r is

r ep resen ted as L[qg~ ,S 2 ] = P8~, + P9 (~v)+ P~o.

We re i t e ra te tha t the subregion is not in t e rms of search var iable uih, but r a t h e r in t e rms of functions of uik. Based on the region S the coefficients pl, (lua-pr and (pc +p4) are calculated and a decision about construct ion of the unde re s t ima to r is made at two levels (Ostrovksy, Achenie and Sinha, 2000). This makes the algebraic s t ruc ture of the unde res t ima to r adaptive.


Chapter 11: CAMD in Solvent Mixture Des ign

M. Sinha & L. E. K. Achenie


Solvents are extensively used in industry to clean equipment parts by separating grease and grime, to suspend solids as in inks and paints, for separation of solid or liquid component from a mixture to be followed by purification (liquid-liquid extraction and gas absorption), and many other purposes. Once media of choice for the processing industry, many organic solvents are being phased out of products and processes for environmental and health reasons (Krishner 1995). One of the most used solvents in a printing press is the '%lanket wash" which is specially formulated to clean ink from lithographic printing presses. There are more than 52,000 lithographic printers in the United States (Adrian 1991) and each use 160 gallons per year (total of approximately 8 million gallons per year). These solvents are eventually released into the atmosphere thus posing considerable environmental problems. There is a tremendous need to replace and/or recover these solvents.

The search for new solvents is usually based on design heuristics, prior experience and direct experimental studies. This approach is inherently trial and error, and therefore costly, time-consuming and may not necessarily yield the solvent with the desired performance attributes. For example, the solvent mixture may have a drying time that is too slow for the intended use. While there is no substi tute for experimental study, there is a definite need for a pre- experimental stage that will quickly and cheaply generate new solvents that are promising enough to be considered for the costly and time-consuming experimental stage.

In a recent article Zhao and Cabezas (1998) discussed different property requirements that should be satisfied in order to design solvents for different applications. In our discussions with the Hartford Courant (Hartford Courant) a major performance issue in the selection of a blanket wash solvent is minimizing the effect of a solvent on the surface characteristics of the rubber blanket on which the printing paper is processed. Many solvents swell the rubber blanket. Environmental restrictions and the need for reduction of operating costs mandates recycling of the spent solvent in the short term. Previous and existing

262

industrial approaches to solvent selection and substitution have relied on data base search and query approaches. For example, S A G E (Center for Aerosol Technology 1993) and S o l v D B (National Center of Manufacturing Science) have large databases of existing solvents and associated processes. Through query and answer sessions, the user is led to a suggested selection of solvents. Eastman Chemicals and Dow chemicals have developed a large database of solvents. The database contains solvent properties such as flash point, thermodynamic functions, vapor pressure and hazard evaluation. Eastman's solvent alternative strategy aims at matching the solubility constants and the evaporation rates (Krishner 1995).

Solvents used in industry are often blended together to meet user requirements. These may include limits on boiling point, viscosity and other transport properties, solute-solvent interaction or the solvent power characterized by its solubility parameters. Other implicit requirements on miscibility have to be satisfied to ensure single-phase mixtures.

Computer Aided Product design (CAPD) has emerged as a powerful strategy for identifying promising compounds with pre-specified levels of certain thermophysical properties. More formally, CAPD is a reverse engineering procedure that incorporates desired levels of physico-chemical properties directly into the design of products. This approach has been applied to polymer design (Venkatasubramanium and Chan 1995); reinforced polymer composite design (Vaidyanathan and E1-Halwagi 1994; Vaidyanathan and E1-Halwagi 1996; Vaidyanathan et al. 1998) liquid-liquid extractant (Gani and Brignole 1983) (Naser and Fournier 1991); and refrigerants (Duvedi 1995; Duvedi and Achenie 1995; Joback 1989). CAPD has been applied to solvent design problems as well. These include, design of solvents for liquid-liquid extraction (Macchietto et al. 1990; Odele and Machietto 1993), and gas absorption processes (Pistikopoulos and Stefanis 1998); solvents for separation processes (Pretel et al. 1994) and solvent blends for paint formulation (Klein et al. 1992). Computer-aided mixture design (commonly referred to as the formulation problem) has been applied to solvent design for liquid-liquid extraction (Gani and Brignole, 1983; Macchietto et al., 1990; Odele and Machietto, 1993), refrigerant mixture design (Duvedi and Achenie, 1997), polymer blend design (Vaidyanathan and E1-Halwagi, 1996), coating applications (Dunn et al., 1997) and in the paint and ink industry.

A generalized approach for designing mixtures appears in articles by Kein, Wu et a/.and by Duvedi and Achenie (Duvedi and Achenie, 1997; Klein et al., 1992). Many mixing rules for prediction of mixture properties have been developed and reviewed by Horvath (Horvath, 1992).

Even though many water-soluble solvents exist that can be used to make a blanket wash formulation, deciding on the composition of the final wash formulation is a trial and error procedure. Moreover mixture property prediction

263

models for aqueous systems are difficult and models are highly nonlinear (Reid et al., 1987; Wu, 1987). Thus there is big incentive in developing a formulation tool tha t can design aqueous blanket wash blends in the presence of nonlinear and (probably non-convex) models. In this study, we employ our interval ari thmetic based global optimization package LIBRA for the systematic design of optimal water-based blanket wash systems.

The paper is organized as follows. Section 11.2 describes the use of lithographic blanket washes. The mixture design model is developed in Section 11.2 followed by a description of interval analysis in Section 11.3. Interval analysis forms the basis for the solution algorithm in Section 11.4. A case study is presented in Section 11.5. Sections 11.6 and 11.7 present conclusions and bibliography. Finally, the appendix section (Section 11.8) gives details of the case study as well as the physical property models employed.

11.2 PROBLEM DEFINITION

11.2.1 L i t h o g r a p h i c B lanke t Washes

Lithographic printing is the most common printing process and is based on the immiscibility of water in oil. Printing ink which is insoluble in water, comprises of resins and pigments suspended in a petroleum-based solvent. The ink is applied on a printing plate, which is pressed onto printing paper to impart the print image. When the plate is dipped in aqueous fountain solution, the ink and fountain solutions repel each other, and the ink is confined to the image area of both the plate and printed material. Roller blankets are used to carry the print paper. At the end of the print cycle, ink residues on the blanket have to be removed using a solvent-based blanket wash solution.

Solvents are extensively used as a major component of ink in the printing industry. The function of a solvent in ink is to act as a vehicle for polymeric resins, pigments and dyes. The ink solvent also assists in wetting and dispersion of dyes and pigments. In letter press and offset lithographic printing processes, the ink is carried to the plate by means of a t rain of rubber rollers commonly called "blankets" as shown in Fig. 1 in Chapter 10. Thus a thin film of ink is distributed over a large surface area on the blankets. These ink solvents are volatile and evaporate to leave behind the pigments and resins on the blanket surface. Cleaning is required whenever the residue buildup affects the print quality and between print jobs. Paper fibers, ink residue, paper coating and dried ink, are types of material that must be removed from the rubber blankets. One of the most used solvents for lithographic printing is the '%lanket wash" which is specially formulated to clean ink and other residue from rubber blankets.

264

Manual cleaning operation (also termed "rag and bucket") involves wiping down the blanket cylinder with a cloth wipe dampened with blanket wash solution. The large volume of soiled rags from these operations are routinely sent to industrial launders who are then faced with the proper disposal of the waste water resulting from laundering the rags. In addition, the industrial launders are saddled with the inefficiencies of solvent use in the printing industry, since they also have to abide by rigid standards on wastewater pollution levels.

Blanket wash solvents are mostly mixtures as opposed to single component solvents. As such, next to solvent performance, one of the most pressing concerns of the printing industry with regard to the environment is the volatile organic component (VOC) level of solvents. At present the VOC levels of solvents used in the printing industry are unusually high, well over 80% and far beyond the industry target of 30%. For example, a commonly used blanket wash, "VM&P naphtha" has a 100% VOC content (United States Environmental Protection Agency, 1997a). Another important issue is minimizing the effect of a solvent on the surface characteristics of the rubber blanket by inducing swelling. Swelling severely affects the print quality in lithographic processes. To enhance the cleaning operation, companies sometimes mix solvents from different vendors. However, as noted earlier, this trial and error approach is costly and may not necessarily yield the solvent mixture with the desired performance attributes. In addition, the solvent for a cleaning operation may not meet safety, health and environmental restrictions.

11.2.2 Mixture Des ign Prob lem Formula t ion

The computer-aided mixture design is composed of three main steps:

(i)

(ii)

(iii)

Selection of pure components from a database (for example, for designing a binary mixture from a set of 10 pure components can

result in (~0/or 45 combinations),

Determining the mixture composition that satisfies the property targets and Ranking the candidate mixtures by some criteria such as overall cost.

The first step is a combinatorial problem; the second step is a continuous problem, which could be nonconvex depending on the nature of the property prediction techniques employed. We propose to use a mixed-integer nonlinear problem formulation that is general enough to handle several types of mixture estimation techniques. Obviously if the number of combinations (binary, ternary,

265

etc.) for pure components is small, one can enumera te them all and solve a series of continuous nonlinear programs.

In the proposed formulation, binary variables are used to denote the presence or absence of a pure-component solvent in the mixture, and a set of continuous variables are used to describe the mole fractions of the components in the mixture. Hence the formulation is mixed-integer in nature . Firs t let us introduce the variables

(binary variable)=I 1_ if pure component i is present in the mixture Yi [o otherwise

x i (continuous variable between 0 and 1) - mole fraction of pure component i in

the mixture

Other pa ramete r s include"

n /'/,max

P~j

number of pure component solvents (basis set) max imum number of pure component solvents in the blend property j of pure component i.

Const ra in ts are imposed for (a) l imiting the number of pure component solvents in the blend; (b) ensuring tha t the mole fraction of an absent component is O; and (c) all the mole fractions add to 1.0. These constraints are by no means exhaustive, and several different ones can be added to achieve a specific solvent mixture design objective. An equivalent mathemat ica l p rogramming model is as follows

Pmix "min f (x, y) x,y

subject to �9

pL <_ P(x, y) <_ pU

~_, Yi <- nmax i

Z Xi ----1 i

O<xi <-Yi

X = [ X l , X 2 . . . . . . , X n ] T

Y=[Yl,Y2," .... ,ym]r

x i ~ [0,1] real

Y i E {0,1} binary

(1)

266

pL and p u are lower and upper limits on a vector of target properties P. These properties may be nonlinear and nonconvex with respect to the search variables xi, yi.

The last constraint in the above formulation ensures that if component i is not present in the mixture (i.e. i=0), then the corresponding composition xi is also 0. This however, can lead to cases where the composition of one component is infinitesimally small. To avoid this we replace it by: yi .s < xi < yi ( l -s) , where s is a small number (e.g. 0.01).

11.3 D E S C R I P T I O N OF THE P R O P O S E D METHOD OF SOLUTION

Since many property estimation techniques are generally nonconvex, we have developed an interval analysis based optimization strategy that can design (globally) optimal mixtures. Interval analysis has emerged as a reliable mathematical tool that can automatically generate lower and upper bounds for a function (Hansen, 1992). It has been used for solving ordinary differential equations, linear systems, and verifying chaos. Interval arithmetic, which is at the heart of interval analysis, was developed by Moore (Moore, 1966).

In essence, interval analysis based optimization continually deletes portions of the search space with the goal of maintaining a final box of desired width that contains the global solution. A number of interval-based optimization procedures have been developed (e.g. Hansen, 1992; Moore et al., 1992; Ratscheck and Rokne, 1991; Vaidyanathan and El-Halwagi, 1994; Van Iwaarden, 1996). Most of these procedures are tailored for unconstrained optimization problems. In addition, these techniques can only handle continuous variables. In other words they do not handle discrete variables. Notwithstanding the attractive features of interval-based global optimization, they are in general computationally intensive. To address some of these issues, we have developed new acceleration strategies, and extended the capabilities of the algorithm to solve mixed integer problems.

11.3.1 B r i e f I n t r o d u c t i o n to Interva l Ana lys i s

An interval, Xi = [ai, bi], containing a real variable xi is characterized by two real scalars a, and bi such that ai ~_ xi ~_ bi. An interval vector X=(XI , X2, ..... ,X i , . . .X ,O T represents a hyper rectangular region in an n-dimensional space R n and is referred to as a h y p e r b o x or simply a box. The width of a box denoted by w ( X ) may be defined as the largest width of the interval X. There are other definitions for the width.

An interval extension F(X) of a continuous function f(x) is obtained by replacing all variables with their interval counterparts. The resulting function is called an

267

interval function and is given as F(X) = [F L, F U] where F L and F U are loose lower and upper bounds on f(x) over the box (domain) X (see Figure 2). In general, the smaller the width of X, the t ighter the bounds are. The true range of a function f(x) over X is denoted by f(X) (or R(X)) such tha t fL and fu correspond to the global max imum and minimum over X. Moore (Moore, 1979) proved tha t

lim F(X)= f(X) w(X)~O

Considerable effort has been expended by interval analysts to produce systematic methods for represent ing an interval function tha t gives the sharpest bounds on the range of a real function over an interval (see for example Ratschek and Rokne, 1984, Neumaier, 1990, and Rokne, 1986). It can be shown tha t for monotonic functions F ( X ) - f(X).

An impor tant property of interval analysis tha t makes it useful for global optimization is the inclusion of functions. Consider a real-valued function f(x). The interval function F(X) is said to be an inclusion isotone of f(x) if (xe Y0 implies •x) e F(X) ) and also if (Y c_ X) implies tha t F(IO ~_ F(X). Operations such as addition, subtraction, multiplication and division have been developed tha t are inclusion isotonic. For more details on the operations and interval analysis see (Hansen, 1992; Moore, 1966; Ratschek and Rokne, 1984).

f(x) ............ i ............ i ............................................................... i ~ ,~,,

F (X) f (x) i R (X)

|

! ' I

Y= . . . . . . . . . . . . . . X x

Figure 2: Continuous function and its interval extension

We summarize some of the notation above as x i - a real scalar number (e.g. 5.0) x - a vector of real scalar numbers xi (e.g. (1.0, 1.4, 2.5) T)

268

Xi = an interval scalar number (e.g. [2.0, 2.2]) X = a vector of interval scalar numbers Xi (e.g. [(1.0, 1.2), (1.4, 1.45), (2.5, 2.6)] T) f(X) (or R(X)) = [fL(X), fv(x)] = the range of a function f(x) on X (this is often unknown) F(X) = [FL(X), Fu(x)] = the natural interval extension of a function f(x) on X., such that FL(X) ~_fL(X) _~[v(X) ~_ Fu(x)].

11.3.2 Global O p t i m i z a t i o n M e t h o d s B a s e d on I n t e r v a l - A n a l y s i s

Almost all interval analysis based global optimization algorithms (Ichida and Fuji, 1979; Moore et al., 1992; Vaidyanathan and E1-Halwagi, 1994) employ a successive domain reduction approach by eliminating portions of search regions, which do not contain the global solution. Consider the continuous optimization model

(globally)min f (x) x

subject to :

g(x)<O (2) h(x)=O

x=[xl,x 2 ...... ,x,] rreal

x~X 0 Almost all domain reduction algorithms invariably use the following tests to systematically remove portions of X 0 that cannot contain the global minimum.

i) Upper Bound Test If the objective value UPBD = f(x) corresponding to a point x in the feasible region (i.e. search space that satisfies all constraints) is known, then any sub-region of X 0 (namely X) satisfying FL(X) > UPBD does not contain the global solution and

can be deleted from the search space.

ii) Infeasibility Test For a sub-region X, if GL(X) > 0 then X does not contain any feasible region and can be deleted from the search space (domain).

iii) Monotonicity Test For an unconstrained problem, at the optimal point the gradient f'(x)=O. Let F'(X) be the interval extension of f" (x). Now if

0.0~ F'(X)

then the sub-region cannot contain an optimal point and only the edge of the subregion is retained and the rest deleted. Note that this test is not appropriate for

269

constrained systems; it can only be applied to a sub-region X for which Gu(X)<_ 0 and H(X) = [HL(X), Hu(x)] = [0, 0].

iv) Non-convexity Test This test is based on the principle that at the optimal point the curvature of the surface defined by the objective function is positive, in other words the Hessian (second partial derivative matrix) of f ( x ) i s positive semi-definite. Let H(X) be

the interval Hessian of the function. To apply this test one checks if /~(X)is

positive semi definite, typically by checking if Hi~ (X)<0 for all values of i=1,2,..n.

If not, then the sub-region is deleted. Again this test is not appropriate for constrained systems; it can only be applied to a sub-region X for which Gu(x) ~_ 0 and H(X) = [HL(X), Hu(x)] = [0, 0].

v) Distrust Region Test Vaidyanathan and Halwagi (Vaidyanathan and E1-Halwagi, 1994) proposed this test for constrained problems. Here the idea is that once an infeasible point, x, is found, a box X is constructed around it such that GL(X)>O. Then the boxed region X is deleted from the search space. The authors did not directly deal with equality constraints.

11.3.3. Modifications Employed in this Study

An interval-based global optimization algorithm can be constructed based on the above tests. However, in our experience it is computationally slow especially for problems with a large number of constraints. Additional domain reduction tests are proposed next.

Upper Bound via SQP Local Optimization

For the initial search space defined by X ~ a good upper bound on the global solution, fUPBD, is found using the locally optimum Sequential (or Successive) Quadratic Programming (SQP) (Bazaraa and Shetty, 1979). SQP has proven to be a very powerful algorithm for gradient-based local optimization. It often requires fewer function evaluations than other competing gradient-based algorithms (Biegler et al., 1997).

Subsequently at any iteration, k, the upper bound (UPBDD for a sub-region Xh is found via SQP. If this upper bound is lower than the overall upper bound UPBD, then UPBDk replaces UPBD. In our experience, in many cases SQP finds the global solution in the first few iterations and the remaining iterations are merely used to verify global optimality.

270

Local Feas ib i l i ty Test

Here the idea is to relax the optimization model and only consider the convex constraints and determine if this relaxed search space contains a feasible solution. This requires the prior specification of which constraints are convex and which are not. This is not always straightforward. However, l inear equality and l inear inequality constraints are simple convex constraints. Based on this reduced set of constraints the feasibility of a sub-region Xh is checked by solving the following feasibility problem

min p p,x

g . . . . . . ( X ) ~ l ~ l

h~ .... (x) =0

x ~ X k

(3)

The above problem is solved via a local optimization algorithm (SQP). Note that for this problem the local and global solutions are identical. If the problem is infeasible (i.e. # > 0) then Xh cannot have any feasible point and can be deleted from the search space.

E x t e n s i o n to MINLP problems

Mixture design problems have relatively small dimension. For a design with a basis set of m pure components the interval dimension is 2m. As indicated earlier, current interval based global optimization algorithms can only solve continuous optimization problems. An extension of the algorithm is required to solve mixed integer nonlinear programs (MINLP) such as the mixture design problem discussed earlier.

Let a hyperbox be represented by [x~,x/] V i = 1,2 ..... n . . This box can be part i t ioned

into two sub-boxes at a branch point Xk*. The first box is represented as[x~,xi] V i ~ k and [~,x~] and the second box is represented as

[xi,~] V i a k and[xk,xk]. We have employed a modified part i t ioning s trategy for

binary variables. If a variable k is a binary variable then the part i t ioning results in two points 0 and 1 on the k th dimension. For the above case a binary parti t ion will result in two boxes, namely first box - [x/,xi] V i r k and [0,0]and second box

= [xi,xi] V i ~ k and [1,1] (see Figure 3).

271

Figure 3: Branching Strategy on continuous and binary variables

The branching strategy at each iteration plays an important role in the efficiency of this algorithm. Accuracy of bounds of functions and constraints using interval analysis also depends on the width of X. To address both of these issues, branching is performed along the dimension that corresponds to the maximum width.

11.4 S T E P - B Y - S T E P ALGORITHM FOR THE S O L U T I O N T E C H N I Q U E

Here the implementation of a global optimization algorithm (LIBRA) that utilizes the domain reduction strategy is presented. In our implementation we employ the interval arithmetic C++ class definitions, data structures and basic linear algebra operations from the public domain software VerGO ~ttp://www- math.cudenver.edu/-rvan/VerGO/VerGO.html) developed by Iwaarden (1996).

2'/2

In the local optimization solver (a C++ implementation of Biegler and Cuthrell 's SQP algorithm (Biegler and Cuthrell, 1985)) the derivative and Hessian information for all functions and constraints are computed via the automatic derivative package ADOL-C (Griewank et al., 1996). For this, the function and constraints have to be supplied in a defined format. Automatic Derivative is also used for the interval functions and constraints. The algorithm is implemented in the following steps and also shown in Figure 4.

S T E P 1: In this step the input data is prepared. The problem is specified including the dimension size, number of inequality and equality constraints and, variable indices corresponding to the binary variables (if MINLP). Bounds on the variable (in other words the original box entered as an interval vector X ~ are specified and define the original search space. Two lists are initialized, box_list (containing the search space), and good_list (containing the candidates for global solution). A tolerance (s) is specified that specifies the maximum width of the interval corresponding to the final solution box that is acceptable. Finally, the original region X ~ is inserted as the only box in the box_list.

S T E P 2" Check if the original region X ~ is feasible. If not, the algorithm terminates with an infeasible solution.

S T E P 3" Initialize Lower bound (LWBD) to -infinity and Upper Bound (UPBD) to the objective value corresponding to the feasible local solution from SQP. If a local solution is not found then set UPBD to +infinity.

S T E P 4: Check if box_list is empty. If yes, go to STEP 8 else go to STEP 5. S T E P 5: Remove the box Xh from the top of the box_list. This box corresponds to

the box with the maximum width. Find the upper bound UPBDk to this box via SQP such that UPBD -f(x*h). If UPBDh < UPBD, then UPBD - UPBDk. If UPBDh > UPBD , delete Xk and go to STEP 4. Next check using monotonicity test (ignoring constraints), nonconvexity test (ignoring constraints) and local feasibility tests. If tests fail, delete Xk and go to STEP 4.

S T E P 6: Check if max width of Xk is less than the specified tolerance. If yes, insert the box in the good_list and go to STEP 4.

S T E P 7: Apply the back boxing technique (the process of identifying a box that surrounds a given point such that the objective function is convex on that box, Iwaarden, 1996) technique to box the maximum convex region around the upper bound solution point x*h. Insert the solution in the good_list, partit ion the remaining search space of Xk. If back boxing is not applicable, partition the box about its maximum width as described earlier. Insert the partitioned boxes to the box_list. Here the boxes are inserted in an ordered list with the box with the greatest max width on

273

upbd, lwbd, eps, X [

Initialize upbd, lwbd Insert X in the box list

I L. Remove box from top of the list F"

@

N

Upper Bound Test: ~ _ ~ Delete [ F(X,3.1ow > ut)bd

f

JInfeasibility ~ Y Test: ~ ~ _ ~ [ Delete

G ( X k ) > 0 0 ~ H(Xk)

Y ~] Insert in

I good_list

Local Optimization in X k sol n = x k

I Backbox about Xk I

Split Box to n-1 sub boxes update upbd

~ Insert in box_list t

P r o n e

good_list

Figure 4: Flowchart of the Global Optimization Algorithm

274

top. Note that back boxing is defined only for an unconstrained problem. For a constrained optimization problem we proceed in one of two ways (a) apply back boxing to the objective function plus constraints, which have been weighted with a penalty parameter, or (b) apply back boxing to the objective function and check if the constraints are satisfied in the result ing box. Go to STEP 4.

S T E P 8: Rank order the boxes on the good list based on the objective function values. Delete boxes for which F L is greater than the UPBD. Terminate and re turn the list of good_boxes containing the globally optimal solution or set of solutions.

The attractive features of this algorithm are summarized as:

1. Analytical expressions for gradients and Hessian for objective functions and constraints are not required. They are not computed by finite differences, rather, their analytical form are automatically constructed via use of the automatic derivative package.

2. If a problem has more than one global solution, then the algorithm finds all the globally optimal solution points.

3. It can be used for design under uncertain parameters. This is a very unique feature in interval-based techniques. In effect, if a parameter is not exactly known, ra ther its nominal point and corresponding confidence interval or some error band is known, and then it can be directly used as an interval parameter . Since parameter correlations are not accounted for in the use of the hyperbox defined by the parameter intervals, parametric uncertainty addressed this way will result in ra ther conservative solutions.

This algorithm has been tested for many benchmark NLPs and MINLP problems. We have shown that for MINLP problems the splitting strategy does not result in a total enumeration of all possible solutions.


Nearly all-conventional blanket washes contain VOCs (Adrian, 1991). In Connecticut alone about 13.5 tons/year of spent solvent is disposed of by the printing industry (Lomasney, 1994). Of this only 0.69 tons are aqueous solutions, 10.57 tons are non-halogenated solvents and 2.26 tons are halogenated solvents. Wel tman discussed the benefits of replacing halogenated degreasers by aqueous substi tutes (Weltman and Evanoff, 1991). The Printing Industry of America (PIA) and the USEPA started a major initiative in the early 1990's to search for al ternative water-based blanket wash solvents (United States Environmental Protection Agency, 1995).

275

Fairly recently the Toxic Research Insti tute (Toxics Reduction Institute, 1997) developed a water-based solvent Printwise TM that can be used as a blanket wash. The AG Environmental Products company recently commercialized a water- based blanket wash product "Soygold | This solvent is a methyl ester of soybean oil and is completely miscible in water. All these water-based blanket washes have very low VOC, low environmental impact, less toxic and easy to recover and are amenable to safe disposal.

The use of surfactants and dispersants give solvents better cleaning characteristics and have found increasing use in many blanket wash formulations (Design for the Environment Program, 1997). Many blanket wash formulations use alkyl benzene sulfonates (ABS) as surfactants. These are water-soluble surfactants and cannot be used in hydrocarbon-based blanket washes. Thus there is a strong incentive for designing water-based blanket wash solvent blends. This case study explores the systematic development of aqueous blends for use as blanket wash solvents. Details of the problem formulation and solution are given in the appendix (section 11.8).


Computer aided blend design is a highly complicated problem. The general unders tanding of variations in blend property with variations in composition has improved substantial ly over the last few decades and many companies have developed computer models to predict blend properties. These models are being used successfully by companies such as DuPont (Wu, 1987). However using such models directly for discovering optimal blends is non-trivial. Moreover identification of binary solvent blend from a set of single component solvents involves a combinatorial search of an optimal pair. To address this we have developed an optimization framework for mathematical representation of the solvent blend design problem. The mathematical framework is an MINLP problem. To solve such design problems we have also developed an interval based global optimization tool LIBRA.

This framework and solution approach has been used to solve an industrially relevant problem of designing optimal blends for blanket wash applications in the printing industry taking into account solvent power, viscosity and surface tension. Seven binary mixture design problems were solved in the case study, for which we have been able to identify a globally optimal blend composition for the solvent mixture.

276


1. Adrian, J. R., Managing Solvents and Wipers. Pollution Prevention Review, 419-425 (1991).

2. Barton, A. F. CRC Handbook of Solubility Parameters and Other Cohesion Parameters, CRC Press, Inc., Boca Raton, Florida (1985).

3. Bazaraa, M. S., and Shetty, C. M. Nonlinear Programming, Wiley, New York (1979).

4. Biegler, L. T., Grossman, I. E., and Westerberg, A. W. Systematic Methods of Chemical Process Design, Prentice Hall, New Jersey (1997).

5. Biegler, L. T. and Cuthrell, J. E., "Improved Infeasible Path Optimization for Sequential Modular Simulators - II: The Optimization Algorithm", Comp. & Chem. Eng., 9(3), (1985)

6. Center for Aerosol Technology, "Solvent Alternative Guide (SAGE) Version 1.0 Technical report," Research Triangle Institute, Research Triangle Park, N.C. (1993).

7. Design for the Environment Program, "Cleaner Technologies Substitute Assessment: Lithographic Blanket Washes." EPA 744-R-97-006, USEPA (1997).

8. Dunn, R. F., Dobson, A. C., and E1-Halwagi, M. M., Optimal Design of Environmentally Acceptable Solvent Blends for Coating. Advances in Environmental Research, 1(2) (1997).

9. Duvedi, A. P., "Mathematical Programming Based Approaches to the Design of Environmentally Safe Refrigerants," Masters Thesis, University of Connecticut (1995).

10.Duvedi, A. P., and Achenie, L. E. K., Designing Environmentally Safe Refrigerants Using Mathematical Programming. Chemical Engineering Science, 51, 3727-3739 (1996).

l l.Duvedi, A. P., and Achenie, L. E. K., On the Design of Environmentally Benign Refrigerarnt Mixtures: a Mathematical Programming Approach. Computers and Chemical Engineering, 21(8), 915-923 (1997).

12.Gani, R., and Brignole, E. A., Molecular Design of Solvents for Liquid Extraction Based on UNIFAC. Fluid Phase Equilibria, 13, 331 (1983).

13. Griewank, A., Judes, D., and Utke, J., Algorithm 755: ADOL-C: A Pakcage for the Automatic Differentiation of Algorithms Written in C/C++. ACM Transactions on Mathematical Software, 22(2), 131-167 (1996).

14.Hansen, E. Global Optimization Using Interval Analysis, Marcel Dekker, Inc (1992).

15. Horvath, A. L. Molecular Design, Elsevier (1992). 16.Ichida, K., and Fujii, Y., An interval arithmetic method to global

optimization. Computing, 23, 85 (1979). 17.Iwaarden, R. J. V., "An Improved Unconstrained Global Optimization

Algorithm," University of Colorado at Denver, Denver (1996).

277

18.Joback, K. G., and Stephanopoulos, G. "Designing Molecules Possessing Desired Physical Property Values." Proceedings FOCAPD 1989, Snowmass Village, Colorado (1989).

19.Klein, J. A., Wu, D. T., and Gani, R., Computer Aided Mixture Design with Specified Property Constraints. Computers and Chemical Engineering, 16 Supplement ($229) (1992b).

20.Krishner, E. M., Environment, Health Concerns Force Shift In Use Of Organic Solvents. Chemical and Engineering News (June 20), 13-20 (1995).

21.Lamasney, R., Pr in ters . , Connecticut Technical Assistance Program (1994).

22.Macchietto, S., Odele, O., and Omatsone, O., Design of Optimal Solvents for Liquid-Liquid Extraction and Gas Absorption Processes. Transactions of the Institute of Chemical Engineers, 68, 429 (1990).

23.Moore, R., Hansen, E. R., and Lecrec, A., Rigorous Methods for Global Optimization. Recent Advances in Global Optimization, C. A. Floudas and M. Pardalos, eds., Princeton University Press (1992).

24.Moore, R. E. Interval Analysis, Prentice-Hall, Englewood Cliffs, New Jersey (1966).

25.Moore, R. E. "Methods and applications of interval analysis." SIAM, Philadelphia (1979).

26.Naser, S. F., and Fournier, R. L., A System for the Design of an Optimum Liquid-Liquid Extractant Molecule. Comput. Chem. Eng., 15(6), 397 (1991).

27.Naumaier, A. Interval Methods for Systems of Equations, Cambridge University Press, London (1990).

28. Odele, O., and Machietto, S., Computer Aided Molecular Design: A Novel Method for Optimal Solvent Selection. Fluid Phase Equilibria, 82, 47-54 (1993).

29.Pistikopoulos, E. N., and Stefanis, S. K., Optimal solvent design for environmental impact minimization. Computers and Chemical Engineering, 22(6), 717-733 (1998).

30.Pretel, E. J., Lopez, P. A., Bottini, S. B., and Brignole, E. A., Computer- Aided Molecular Design of Solvents for Separation Processes. AIChE Journal, 40(8), 1349-1360 (1994).

31.Ratscheck, H., and Rokne, J., Interval Tools for Global Optimization. Computers Math. Applic., 21(6/7), 41-50 (1991).

32.Ratschek, H., and Rokne, J. Computer Methods for the Range of Functions, Halsted Press, New York (1984).

33.Reid, R. C., Prausnitz, J. M., and Poling, B. E. The Properties of Gases and Liquids, McGraw Hill, New York (1987).

34.Rokne, J. H., Low Complexity k-dimensional centered forms. Computing, 37(247-253) (1986).

35.Tamura, M., Kurata, M., and Odani, H., Bull. Chem. Soc. Japan, 28(83) (1955).

278

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

Toxics Reduction Institute, "Demonstration of Printwise TM: A "Near Zero" Lithographic Ink and Blanket Wash System." 39, University of Massachusetts, Lowel (1997). United States Environmental Protection Agency., "Cleaner Technologies Substitutes Assessment: Lithographic Blanket Washes." EPA 744-R-97- 006, United States Environmental Protection Agency (1997b). Vaidyanathan, R., and E1-Halwagi, M., Computer-Aided Design of High Performance Polymers. J. Elastom Plasti., 26(3), 277 (1994a). Vaidyanathan, R., and E1-Halwagi, M., Global Optimization of Nonconvex Nonlinear Programs Via Interval Analysis. Computers and Chemical Engineering, 18(10), 889-897 (1994b). Vaidyanathan, R., and E1-Halwagi, M., Computer-Aided Synthesis of Polymers and Blends with Target Properties. Ind. Eng. Chem. Res., 35(2), 627-634 (1996). Vaidyanathan, R., Gowayed, Y., and E1-Halwagi, M., Computer-aided design of fiber reinforced polymer composite products. Computers and Chemical Engineering, 22(6), 801-808 (1998). Van Iwaarden, R. J., "An Improved Unconstrained Global Optimization Algorithm," Ph.D., University of Colorado, Denver (1996). Venkatasubramanium, V., Chan, K., and Carruthers, J. M., Evolutionary design of molecules with desired properties using the genetic algorithms. J. Chem. Inf. Comput. Sci, 35, 188 (1995). Weltman, H. J., and Evanoff, S. P., "Replacement of Halogneated Solvent Degreasing with Regenerable Aqueous Cleaners," General Dynamics Corporation, Fort Worth, Texas (1991). Wu, D. T., Modeling and simulation in the coating industry. Chemtech (January 1987). Zhao, R., and Cabezas, H., Molecular Themodynamics in the Design of Substitute Solvents. Industrial and Engineering Chemistry Research, 37, 3268-3280 (1988).

11.8 APPENDIX: DETAILED SOLUTION OF CASE STUDY

11.8.1 Case Study Objective

Even though many water-soluble solvents exist that can be used to make a blanket wash formulation, deciding on the composition of the final wash formulation is a trial and error procedure. Moreover mixture property prediction models for aqueous systems are difficult and models are highly nonlinear (Reid et al., 1987; Wu, 1987). Thus there is a big incentive in developing a formulation tool that can design aqueous blanket wash blends in the presence of nonlinear and (possibly non-convex) models. In this study, we employ our interval

279

arithmetic based global optimization package LIBRA for the systematic design of optimal water-based blanket wash systems.

11.8.2 Basis Set

The EPA report on blanket wash risk assessment (Design for the Environment Program, 1997) lists 40 different formulations (or solvent blends) used as blanket washes by different printing facilities throughout the United States. However, due to propriety reasons their compositions are not reported. Out of these, 21 formulations contain petroleum distillates (hydrocarbons and/or aromatic hydrocarbons), which pose considerable environmental health and safety risks. Two common aromatic hydrocarbons used in blanket washes are 1-2-4 trimethyl benzene (C9H12) and isomers of xylene (CsH10). Trimethyl benzene has a flash point of 54.4~ and log Kow of 3.78. Isomers of xylene have flash point as low as 17oC and log Kow of 3.15. Thus both are flammable and have high bioaccumulation and toxicity and are shown in Table 1 below.

Table 1: Two aromatic hydrocarbons used in many commercial blanket washes. 1-2-3 Tr imethyl Benzene

~ CH 3

C ~ 3 ~ ~ C H 3

OSHA PEL: 200 m g / m 3 Log Kow =3. 78 Log B C F = 2.53 Log Koc = 2. 86 Water Solubility (g/L) =0.02

Xylene

CH3 o-xylene m-xylene p-xylene

OSHA PEL: - Log Kow = 3.15 Log BCF = 2.16 Log Koc = -0. 69 Water Solubility (g /L) = 0.1

The pure component solvents employed in this case study are non-halogenated and non-aromatic water-soluble compounds. Also only those solvents, which have relatively small environmental and health impact are selected. These solvents are listed in Table 2 and the desired attributes for optimal blanket wash formulation are defined in Table 3.

Sources: EPA Report on Cleaner Technologies Substitutes Assessment: Lithographic Blanket Washes (United States Environmental Protection Agency, 1997), and SOLV-DB, Solvent Database at: "ht tp: / / so lvdb .ncms.org/so lvdb .h tm' , National Center of Manufacturing Science.

280

These attributes target the solvent power, its flow characteristics, surface contacting and environmental impact. Note that by constraining both density and viscosity, we have constrained the kinematic viscosity (p/p) of the blends. The pure component properties of the b a s i s se t are presented in Table 4.

T a b l e 2: S o l v e n t s u s e d in t h e c a s e s t u d y to d e s i g n b l e n d s .

1. Methyl Ethyl Ketone (MEK)

OSHA PEL: 200 mg/m 3 Log Kow = 0.29, Log BCF = 0.0 Log Koc = O. 72 Water Solubility (mg/gg) = 223 000

T-Butyrolactone (GBL)

OSHA PEL : - Log Kow = -0. 64, Log BCF = -0. 7 0

I I

Log Koc = 0.85 1 1 Water Solubility

(mg/Kg) = oo 0 "7

\ / r

N-Methyl Pyrollidone (NMP)

N

OSHA PEL: - Log Kow = -0.11, Log B C F = - 0.31 Log Koc = 1.32 Water Solubility (mg/Kg) = oo

Propylene Glycol (PG)

OSHA PEL : - log Kow = -0.92, log BCF = -

0.82 Log Koc = 0.0 Water Solubility (mg/Kg) = oo

(x-Terpineol

Diethylene Glycol Monomethyl Ether (DGME)

OSHA PEL: 1 O0 (mg/m 3) Log Kow = -1.18 Log BCF = -1.1 Log Koc =-1.3 Water Solubility (mg/Kg) = oo

Water

Diethylene Glycol Monoethyl Ether (DGEE)

OSHA PEL: lO0(mg/m s)

Log Kow = -1.18 Log BCF = -1.1 Log Koc =-1.3 Water Solubility (mg/Kg) = oo

OSHA PEL: - Log Kow = 3.33 Log B C F = 2.30

Log Koc - 1.76 Water Solubility (mg/Kg) = oo

281

Table 3: Desired attributes of an optimal blanket wash blend. S o l v e n t Power : Based on Solubility Interaction Radius of Blend and Polymeric Resin. RiJ<R *

The resin is a phenolic resin, Phenodur | 373 U, (Barton, 1985). Solubility parameters are: 5 D - 19.7, 5P=11.6, and 5i-1-14.6; and interaction radius R* - 12.7 (all in Mpal/2).

...... p . . e .n . . s . i . t x : . . . . (~ . . . . a . s . . . s .p . . . . . e . . . c . i . f i . . g . . . . . . g . . . r . . . a .v i . . t . y ) . . . ; ................................................................................................................... [ 0 . . . 9 . . . . . - . . . . . 1 . 4 . ! ...................................

..... Y.i...s....c..~.s...i...t..y....:....(..~....i.n.....c...e....n...ti..p....~...i..s.e.)...~ ...................................................................................................................................... [0. . .8 . . . . . - . . . . .1 . . . .4 .! ...................................

S u r f a c e T e n s i o n ((~ in dyn/cm2): [30 .0- 45.0]

V a p o r P r e s s u r e (psat, in mmHg) .............................................................................................................................................................................................. ;. ................................................................................................... [ o . . . . . t . . . . 0_2 ] ................................................

I n h a l a t i o n E x p o s u r e (IE in rag!day) . . . . . . . . [0 to 2] P e r m i s s i b l e E x p o s u r e L imi t (PEL in m g / m a) [0 to 100]

Table 4: Pure component prot~erties of solvents in the basis set. C o m p o n e n t Ix (~ ] psat ~D ] ~P 5H p

(cp) (dyn/cm 2) ](mmHg) (Mpal/2)] (Mpal/2) (Mpal/2) (sg) Methyl Ethyl 0.378 24.600 95.300 14.100 9.300 9.500 0.801 Ketone Butyrolactone 1.700 40.430 3.200 18.600 12.200 14.000 1.120 NMP 1.660 40.700 0.334 16.500 10.400 13.500 1.001 Propylene Glycol 19.000 36.510 0.200 11.800 13.300 25.000 1.034 DGME 3.480 28.190 0.180 16.200 9.200 14.300 1.229 DGEE 3.850 29.530 0.126 16.200 9.200 12.300 1.025 a-terpinol 36.500 31.600 0.490 13.900 7.900 10.200 0.819 Water 1.000 70.000 50.000 26.500 23.300 14.800 1.000

The mathematical formulation for the problem is

PBLEND: minimize R ~ : [4 (5~-5 ; ) 2 + ( t i p - 5 ; ) 2 + (5 , --5H)2] 1/2

subject to:

5D --" E O i S D i i

5p -- E ( ~ iSPi i

5 H = ~ , O iS Hi i

._ X i 5

i

(Solvent Power)

282

Z x~ • MW~ (Density Constraint)

~ L ~-~mix ~ U

ln(l~mixemix ) - - X l l n ( r l l e l ) + x 2 ln(rlveew )

E = Vc /3

( T c M ) 1/2

i j

M(mix) = Z xi M i i

Vc( ij ) -: + )

(Viscosity Constraint)

0 .L <_ O'mi x <__ 0 "U

.1/4 =l].lw(~wl/4 +V]oO..ol/4 mix

Itto- l- ~tw

~ lOglo (xwV w + xoVo) 1-q +0.441 q ff~176 . . . . (~ wVw 2/3 tog~0 (1-qJw) XoVo T q

(Surface Tension Constraint)

Z 1~ sat sat max x ir <_ P - (Vapor Pressure Constraint)

118.8*xi*pi <_ PELI (OSHA Constraint on Permissible Exposure Limit)

(0. 48At)*Gi <_ IE max (Inhalation Exposure Limit)

xw > 0.3 (Water fraction more than 30%)

zV_,xi= 1 (mole fraction constraint)

More details of these models have been described in the Appendix. An MINLP model was formulated and solved two ways. In Case 1, the model (PBLEND) was solved by fixing the binary variables resulting in an NLP model. Specifically each binary mixture was constructed by fixing the binary variables for water and one of the other pure component solvents to 1; the remaining binary variables

283

were set at 0. The surface tension model requires the calculation of the water fraction Tw, which becomes an additional search variable.

In Case 2 (PBLEND_MINLP) the MINLP model was solved rigorously, i.e. without fixing the binary variables. This is a relatively more difficult problem with 17 variables and 8 binary variables. Also in this case we only considered 2- component blends (i.e. binary blend, not to be confused with binary variable). Thus the solution approach not only picks the best combination of 2-component solvents (combinatorial problem) but also finds the optimal composition (continuous problem).

11.8.3 Resul t s and D i scus s ion

Table 5: Computational results of blend design case study- Case 1. C o m p o n e n t R ij x l xw Ww Iter. CPU

(secs)

MEK-Water 10.34 0.14 0.86 0.31 31 3.4

Butyrolactone- 4.40 0.45 0.55 0.10 17 2.5 Water NMP-Water 6.54 0.46 0.54 0.07 33 4.1 PropyleneGlycol- 12.94 0.06 0.94 0.47 29 5.9 Water DGME-Water 10.02 0.07 0.93 0.25 23 13.3

DGEE-Water 8.58 0.10 0.90 0.18 25 8.7

a-terpineol-Water 10.16 0.08 0.924 0.16 21 5.1

Any component- 4.40 0.45 0.55 0.10 1013 242.0 Water (GBL) (Water)

The model PBLEND and PBLEND_MINLP w a s solved to obtain 7 different binary mixtures as shown Table 5. From Table 5, consider for example propylene glycol and water blends for which the globally minimal objective function, RiJ, is 12.94. Unfortunately no propylene glycol and water blend will fall within the interaction radius (12.7) of phenolic resin. Therefore, it is expected tha t no propylene glycol and water blend will be effective in dissolving phenolic resin. Among all solutions, the lowest objective value is achieved by a ~,-butyrolactone and water blend with interaction radius of 4.4. The attributes of the solvent blends in Table 5 are tabulated in Tables 6a-6b.

284

Table 6a: Blend Properties. P psat m # P E L

Component (sg) (mmHg) (mg/day) (mg/m 3)

MEK-Water 0.910

Butyrolactone-Water 1.093 1.440

NMP-Water 1.001 0.154

PropyleneGlycol-Water 1.007 0.012

DGME-Water 1.067 0.013

DGEE-Water 1.011 0.013

a-terpineol-Water 0.922 0.037

13.628 116.825 15.348

15.235 58.045

1.828 54.339

0.115 7.130

0.125 8.917

0.195 12.173

0.563 8.237

Component

Table 6b: Blend Properties. ~ 8. am~x ~tm~

(Mpa ~/2) (gpa 1/2) (Mpa in) (dyn/cm 2) (cp)

MEK-Water 20.860 16.932 12.389 35.000

Butyrolactone-Water 20.359 14.672 14.178 42.862

NMP-Water 18.259 12.669 13.729 42.380

PropyleneGlycol-Water 23.457 21.230 16.911 38.350

DGME-Water 23.484 19.171 14.654 30.260

DGEE-Water 21.841 16.922 13.669 31.572

a-terpineol-Water 21.075 16.669 12.819 33.592

0.830

1.363

1.322

1.218

1.181

1.236

1.361

# Inhalation Exposure

Water (with pure a =70) is the major component in all the blends shown in Tables 6a-6b. We note that the surface tension is highly nonlinear in that a small organic fraction in the aqueous blends results in a very large change in surface tension. For example, 6% of propylene glycol (with ~ =36.51) reduced the aqueous blend's surface tension from 70 to 38.35. This behavior is also true in practice as verified by many experimental results (Tamura et al., 1955).

Case 2 (PBLEND_MINLP) is solved next. The solution found is identical to the second case (aqueous blend with 7-butyrolactone with mole fraction 0.450. However, for this problem the number of iterations (1013) is much higher. Consequently the CPU time is relatively large (242.3 seconds). Thus it appears

285

tha t when the number of alternatives is small, the MINLP formulation is less efficient than a complete enumeration.

11.8.4 Mixture Property Models

The mixture property models used in the case study are outlined here.

So lvent P o w e r

For a solvent mixture, each component of the solubility parameter can also be computed by

5 D ~-" Z O i S D i i

(~p = Z ~ i ~ p i i

i

where Oi is volume fraction of each component expressed as"

xi5 I~i Z x i V i

i

Phenolic resins are commonly used in printing inks. The dried ink (solute) is assumed to be phenolic resins, specifically " S u p e r B a k a c i t e | 1001 , R e i c h h o l d ' "

The solubility parameter of the resin are nonpolar (Sd) - 23.3, polar (Sp) - 6.6 and hydrogen bonding (Sh) - 8.3 MPa 1/2 (Barton, 1985). The radius of interaction (JR) = 19.8 MPa 1/2. Thus, solvents, which can effectively dissolve the ink residue, have the following solute-solvent interaction constraint:

ij R ._. ( 4 ( i ~ ) o _23.3)2 ..}..(i~p _6.6)2 _}.(i~H _ 8 . 3 ) 2 )1/2 < 19.8

Flow Character i s t i c (viscosity)

Most liquid mixture viscosity models assume that the pure component models are available. Reid et al. discuss two different models for mixture viscosity, namely (a) the Grunberg and Nissan model and (b) the method of Teja and Rice. While the first model works well for organic liquids, the Teja and Rice model is specially recommended for aqueous blends. The equation for binary mixture viscosity is:

286

ln(JTmixSmi x ) = X 1 ln(r/,e, ) + x 2 ln(r/2e 2 )

E - - ~ Vc /3

(TOM) '/2

Vc(min) : Z Z x ix j Vci ] i j

M(mix) = E Xi Mi i

Vc(ij ) =

where 77i is the pure component viscosity evaluated at T(Tci/Tcm), Vc and Tc are the critical volume and temperature and M is the molecular weight. Flow characteristic requirements are posed as

~L <~ ]r](product) <~ 7.]U

Surface Contacting

Surface contacting determines how effective the solvent will be in wetting the blanket surface; thus it characterizes the solvent's cleaning ability. High surface tension also t ranslates to more energy utilization especially if the cleaning is performed via a wiping operation. Surface tension of aqueous solutions is more difficult to predict than non-aqueous models because of the nonlinear dependence on mole fraction. Small concentration of organic material may significantly affect the surface-tension value. For many binary organic-aqueous mixtures, the method of Tamura, Kurata, and Odani (Tamura et al., 1955) is recommended (Reid et al., 1987):

O.1/4 1/4 1/4 mix = [Vw0.w 3v l~ro 0. o

where

0.mix = surface tension of mixture, dyn/cm 0.w = surface tension of pure water, dyn/cm 0.o = surface tension of pure organic compound, dyn/cm Vo = l- Vw

and ~tw is defined by the relation

287

I = loglo (xwVw) (xwV w + xoVo)~-q + 0.441 q (~~176 l~176 ( 1 - ~ w ) L xoVo r q

Here

xw = bulk mole fraction of pure water Xo = bulk mole fraction of pure organic component Vw = molar volume of pure water, cm3/mol Vo = molar volume of pure organic component, cm3/mol q = constant to be read from a table, depends on the type and size of the organic component. For example, q=nc for fatty acids and alcohols, and (he -1 ) for ketones. Here nc is the number of carbon atoms in the molecule. Expected errors are reported to be less than 10% for q less than 5 and less than 20% for q greater than 5.


Chapter 12: Refr igerant Des ign Case Study

A. Apostolakou & C. S. Adjiman

12.1 INTRODUCTION

The CAMD problem formulation presented in Chapter 4 is applied to the problem of refrigerant design. In the early days of the development of refrigerants, chlorofluorocarbons emerged as the most likely candidates for effective refrigerants. However, following their widespread adoption, the impact of fully halogenated chlorofluorocarbons (CFCs) on stratospheric and tropospheric ozone, global warming and acid deposition has been detected. As a result of the Montreal Protocol, these compounds have been phased out, stimulating a search for alternatives.

Thermodynamic and transport properties can be used to evaluate the performance of a refrigerant system. The most important thermodynamic properties are the boiling point, vapor pressure, liquid specific heat, and enthalpy of vaporization. Joback and Stephanopoulos (1989) were the first to report the computer-aided molecular design of replacement refrigerants, followed by Gani et

al . (1991). Environmental constraints in the form of ozone depletion potential have since been explicitly considered by Duvedi and Achenie (1996) and Churi and Achenie (1996). Sahinidis and Tawarmalani (2000) have applied a global optimization algorithm to the problem. In this example the refrigerant design problem of finding alternatives to freon-12 (CC12Fz), first proposed by Joback and Stephanopoulos (1989), is considered. The objective is to find a refrigerant as good as or better than freon-12. This is defined as a refrigerant with a larger heat of vaporization and with a smaller liquid heat capacity than that of freon-12. The replacement refrigerant obtained should be analyzed with respect to environmental properties.

One important consideration in material replacement is the desire to use existing equipment and processing technology. This requires that both the old and the substitute material should have similar transport and other properties such as heat capacity and vapor pressure (Sinha et a l . , 1999). An existing refrigeration process for which a replacement refrigerant is to be found is defined by the following temperatures (Joback and Stephanopoulos, 1989):

1. evaporating temperature Te - 272 K,

290

2. condensing temperature Tcd - - 316 K, 3. mean process temperature Tm - 294 K.

Sa tura ted conditions are assumed so that the evaporating tempera ture Te is equal to the saturat ion temperature (Ts). The relevant properties of freon-12 at the operat ing conditions are as follows"

1. enthalpy of vaporization at Te" H re = 18.4 kJ mo1-1 , v, freon

2. liquid specific heat at mean temperature Tin, C Tm pl,freon = 27.1 cal mo1-1 K -1 .

12.2 PROBLEM FORMULATION

The formulation of the CAMD problem proceeds in three stages (Joback, 1987). First, property constraints and the objective function are formulated. Then, group contribution relations between molecular structure and the properties of interests are added to the problem. Finally, the first order groups which are allowed to appear in the candidate refrigerants are selected, using physical information on the problem. The notation of Chapter IV is used.

12.2.1 Proper ty cons tra int s and object ive funct ion

The main property constraints arise from the requirement to obtain refrigerants tha t have a process performance at least as good as freon-12:

H re _> 18.4 kJ mol "1 (1)

C T" < 27.1 (cal mo1-1 K -1) p l - (2)

The enthalpy of vaporization should be high to limit the volumetric flow. A low liquid specific heat reduces the amount of refrigerant that flashes upon passage through the expansion valve. The heat capacity constraint is evaluated at the mean process temperature. The vapor pressure of the molecule at the operating tempera tures is considered to be an important property for refrigerant design.

The lowest pressure in the refrigeration cycle (ps Te ) should be greater than

atmospheric. This reduces the possibility of air and moisture leaking into the system. Here, a minimum of 1.4 bar is imposed. On the other hand, a high system pressure increases the size, weight, and cost of equipment. A pressure ratio of 10 is considered to be the maximum for a refrigeration cycle and as a

291

result, the highest pressure in the system (pTca) is limited to a maximum of 14

bar (Joback and Stephanopoulos, 1989):

ps/; > 1.4 (bar) (3)

p T~a <14 (bar) (4)

Environmental, health and safety concerns are important issues for refrigerant design. The ozone depletion potential (ODP) and the tropospheric lifetime are related to the environmental impact of a compound. Non-flammability, good stability and low toxicity are also desirable. However, in this case study, only the most fundamental thermodynamic criteria (heat of vaporization, liquid specific heat and vapor pressure) are considered in the optimization formulation. The environmental characteristics as well as the stability of the molecules designed are considered after the optimization problem has generated some candidate molecules.

Once the main property constraints have been identified, the objective function to be minimized can be chosen. While many types of performance objective can be proposed to define a good refrigerant, the function used in this case study

emphasizes the need for a high heat of vaporation, H Te , and a low heat capacity,

CpT7 (Churi and Achenie, 1996)"

c ~ F= pt (5)

H5

12.2.2 Structure-property relationships

In order to relate the property constraints and objective function to molecular structure, the group contribution method proposed by Marrero and Gani (2001) is used to estimate the following four properties:

�9 normal boiling point, Tb in K, �9 critical temperature, Tc in K, �9 critical pressure, Pc in bar,

�9 s tandard enthalpy of vaporization at 298 K, H 298 in kJ mol 1

The property estimation functions are used with only first order contributions are taken into account.

292

222.543 keG~

Tc nlkCkC exp 23 i_239)- k~G '

(6)

(7)

(Pc - 5"9827) -0.5 - 0.108998 = Z nlk C ; c (8)

Hv 98 11.733 Z C~ ~ (9) - = nlk

keG~

where C X denotes the contribution of the first-order group of k to property X. The

enthalpy of vaporization at Te is estimated using the Watson relation and the est imates of critical temperature, and s tandard enthalpy of vaporization (Reid et al., 1987)

298[1 138 -~v 1_T298

(10)

where TrY x denotes the reduced temperature at temperature Tx.

The liquid specific heat C Tm is estimated using the group contribution method of pl

Chueh and Swanson (1973). This method provides an estimate of the heat capacity at 293K. Since the mean process temperature is 294K, the heat capacity at 293K can reasonably be used. The set of groups in the Chueh and Swanson method is different from the set of first order groups listed in Table 1 of Chapter IV. Chueh and Swanson groups will be denoted by bold italics. In order to be able to describe the same set of molecules as with the Marrero and Gani groups, some of the Chueh and Swanson groups must be split, while others must be combined. For instance, the CHOH group of Chueh and Swanson is split into CH and OH in the Marrero and Gani set. Conversely, the CF group of Marrero and Gani does not exist in the Chueh and Swanson set, and can be obtained by combining C and F. Because of these differences between the two sets of groups, some corrections must be introduced in the group contribution formula. Thus, if a CH group is bonded to an OH group, the contribution these two groups make to the heat capacity should be equal to the contribution of CHOH (18.2 cal mol 1 K -1) and not to tha t of CH+OH (15.7 cal mol ~ K-~). Furthermore, the Chueh and Swanson approach requires the application of certain rules which modify the group contribution formula, such as a C1 contribution which depends on the number of C1 atoms bonded to a single carbon. The heat capacity (in cal mo] 1 K "1) of any molecule built from the groups in Table 1, Chapter IV, is therefore given by

293

C7~ pl = Z nlk Cpl, k + Z C~ (11) keG 1 o~ 0

where C~l,h is the contr ibut ion to the hea t capacity for each group ke G1, 0 is the set of addi t ional rules which mus t be followed to calculate the hea t capacity, and

O Cpl,o is the contr ibut ion from the application of the o th rule in set O. The re levant

rules for this example are derived after the set of groups to be used in the design has been chosen.

The vapor pressures at the evaporat ing t empera tu re , pTe, and at the condensing

t empera tu re , p~cd, are es t imated using the Pi tzer expansion with the Ambrose

and Wal ton coefficients (Poling et al., 2000)

ln[ P @ ) = fo(Tr Tx )+COfl(Tr Tx ) + (_o2f2 (T Tx ) (12)

which requires the calculation of reduced t empera tu re s Tr Tx at both

t empera tu re s , the rat io of boiling point to critical t e m p e r a t u r e 0, and the acentric factor value co, where

co=-lnPc +(5"97616(1-0)-1"29874(1-0)1"5 +0"60394(1-0)2"5 + 1"06841(1-0)5)/0 (13)

(-5.03365(1 - 0) + 1.11505(1 - 0) 15 - 5.41217(1 - 0) 2.5 - 7.46628(1 - 0) 5) / 0

o = rb 7~

Z'x = l - TrTx

fo(T Tx ) = (-5.97616"r x + 1.29874Tx 1"5

fl ( TT~ ) = (-5.03365"rx + 1.11505Tx 1"5

f2 ( TTx ) = (-0.6477 I'C x + 2.41539"t'x 15 - 4.26979Tx 25 + 3.25259"Cx 5 ) / T Tx

- 0.60394'rx 2"5 - 1.0684 lZ'x 5) / T Tx

- 5.41217Z'x 2"5 - 7.46628Z'x 5 ) / T Tx

(14)

(15)

(16)

(17)

(lS)

12 .2 .3 C h o i c e o f f i r s t o r d e r g r o u p s

The choice of first order groups is based on knowledge of the problem and avai labi l i ty of contr ibut ion pa rame te r s for the group contr ibut ion methods employed. Thus, the choice of first order groups is par t ly based on the functional groups p resen t in cur rent ly available refr igerants , typically C1 or F conta ining

294

groups. However, groups such as CC12, CHF and CHF2 lack contribution parameters for some of the relevant properties and are therefore not included. The set of seventeen first order groups shown in Table 1 is selected to design acyclic refrigerants. The contributions for these groups in the method of Marrero and Gani (2001) are shown in Table 2.

Table 1: Set of first order groups used for the refrigerant design problem (with group number).

CH3 (1) CH2 (2) CH (3) C (4) OH (29) COOH (31) CH2C1 (108) CHC1 (109) CC1 (110) CHC12 (111) CC13 (113) CH2F (114) CF (116) CF2 (118) CF3 (119) Br (128) C1 (130)

Table 2: Parameters for the groups and properties used. A Vk, a C[b4 CTc4 cPc4k C v4 Cpl'k5

CH3 CH3 1 0.8491 1.7506 0.018615 0.217 8.80 CH2 CH2 2 0.7141 1.3327 0.013547 4.910 7.26 CH CH 3 0.2925 0.5960 0.007259 7.962 5.00 C C 4 -0.0671 0.0306 0.001219 10.730 1.76 OH OH 1 2.5670 5.2188 -0.005401 24.214 10.7 COOH COOH 1 5.1108 14.6038 0.009885 17.002 19.1 CH2C1 CH2C1 1 2.6364 6.2561 0.021419 11.754 0 CHC1 CHC1 2 2.0246 4.3756 0.015640 12.048 0 CC1 CC1 3 1.7049 3.7063 0.009187 16.597 0 CHC12 CHC12 1 3.3420 7.8956 0.028236 17.251 0 CC13 CC13 1 3.9093 8.8073 0.036746 20.550 0 CH2F CH2F 1 1.5022 3.3179 0.023315 8.238 11.26 CF CF 3 1.0084 2.1633 -0.010120 6.739 5.76 CF2 CF2 2 0.5142 0.8543 0.018572 1.621 9.76 CF3 CF3 1 1.1916 1.7737 0.048565 7.352 13.76 Br Br 1 2.4231 4.5036 -0.001460 9.888 9.0 C1 C1 1 1.5147 4.0947 0.007923 2.107 ~ 0

The search space is restricted to single molecules constructed by this set of first order groups. Furthermore, the size of the molecule is limited to 5 first order groups (N/max=5). This limit is suitable for the refrigerant problem since molecules with a higher molecular weight do not have vapor pressures in the suitable range for refrigerants (Churi and Achenie, 1996).

4 From Table 6 of Marrero and Gani (2001). Ho contribution in kJ/mol. From Chueh and Swanson method (1973), adapted to this set of first order groups, in cal mol 1 Kk No value for this parameter was provided in Marrero and Gani (2001). This value was regressed using a set of

compounds containing the C1 group.

295

12.2.4 F o r b i d d e n b o n d and o ther spec i f i c c o n s t r a i n t s

The forbidden bonds for this set can be identified following the systematic strategy described in Section 4.3. All the first order groups belong to the set of "standard groups". Thus, rule 3 is the only rule which may be violated by some of the possible bonds. Group C1 should not be used to form CH2C1 (CH2 and C1), CHC1 (CH and C1), CC1 (C and C1), CHC12 (CHC1 and C1), and CC13 (CC1 and 2 C1 groups). The following constraints are therefore imposed

E y(i,CH2,a),(j,Cl,a) -- O. (19) i,j

E Y(i,CH,a),(j,CI,a) = O. (20) i,j

E Y(i,C,a),(j,Cl,a) = O. (21) i,j

EY(i,CHCI,a),(j,CI, ) = (22) O. i,j

E E (Y(i, CCI, a),(jl,CI,o) + Y(i,CCI,a),(j2,CI,a))<-1,Vie V. ( 2 3 )

jl~V j2EV

Furthermore, it is desirable to prevent the formation of CFCs by the simultaneous presence of chlorinated groups and fluorinated groups. We define Gct={CH2C1, CHC1, CC1, CHC12, CC13, C1}, the set of chlorinated groups, and GF={CH2F, CF, CF2, CFa}, the set of fluorinated compounds. We define a binary variable (cl such that

1~, if there is at least one group in GCI in the compound (Cl to , otherwise

Then,

Ui,k <- ~Cl, V ie V, Vk e Gcl. ( 2 4 )

5 <_ (25)

ke Gct i=1

gc! + ui, k < 1,Vk e G F,Vi e V. (26)

To calculate the heat capacity, Table 1 is used to identify the relevant groups in the Chueh and Swanson set. They are CHa, CHe, CH, C, COOH, CH2OH, CHOH, COH, OH, Cl, Br, F. The contributions C~l.i for every group ke G1 are

296

derived from the contributions of the Chueh and Swanson groups and are listed in Table 2.

The following rules must be applied:

1. For every C1 group linked to a given carbon group (CH3, CH2, CH, C, CH2=CH, COOH, CH2C1, CHC1, CC1, CHC12, CC13, CH2F, CF, CF2, CF3), add 8.6 cal mol-lK -1.

2. For every third or fourth C1 group linked to a single carbon group (CH, C, CHC1, CC1, CHC12, CC13, CF), a d d - 2 . 6 cal mol 1 K -1.

3. If a given CH2, CH2C1 or CHeF is bonded to at least one OH, a d d - 0 . 4 6 cal mol-1 K-1.

4. If a given CH, CHC1 or CHC12 is bonded to at least one OH, add 2.5 cal mol-1 K-1.

5. If a given C, CC1, CC13, CF, CF2 or CF3 group is bonded to at least one OH group, add 14.14 cal mol 1 K -1.

Rules 1 to 2 are equivalent to the rules of the original method of Chueh and Swanson but apply to the Marrero and Gani groups. Rules 3 to 5 have been added to correct for the differences between the two sets of groups. The ma themat i ca l equivalent of these rules are given by the following constraints.

R u l e 1 - For every Cl g r o u p l i n k e d to a g i v e n carbon group (CH3, CH2, CH, C, CH2=CH, C O O H , CH2C1, CHC1, CCI, CHCI2, CClz, CH2F, CF, CF2, CFs), a d d 8.6 cal mo l 1 K 1 .

We introduce a new variable pcl, i which denotes the number of C1 groups linked to a carbon atom at vertex i, ie V

]'lCl, i = Z (Y(i, CH3,a),(j, Cl, a) + Yi, CH2,a),(j, Cl, a) + Yi, CH,a),(j, Cl, a) J

§ Cl,a) + Y(i, COOH,a),(j, Cl, a) + Y(i, CH2Cl, a),(j, Cl, a ) +Y(i, CHCI, a),(j, CI, a) + Y(i, CCl, a),(j, Cl, a) + Y(i, CHCl2,a),(j, Cl,a ) +Y(i, CCl3,a),(j, Cl, a ) + Y(i, Cg2F,a),(j,Cl,a ) + Y(i, CF,a),(j,Cl,a) +Y(i, CF2,a),(j,Cl, a ) + Y(i, CF3,a),(j, CI, a ) +ui, CH 2 Cl + ui, CHCl + ui, CCl + 2ui, CHCl 2 + 3ui, CCl 3

(27)

VieV.

where the Ui, k variables, which denote the existence of group k at vertex i, are used to count the C1 atoms which appear in carbon containing first order groups. The contribution from rule 1 is

C ~ = 8 .6Z l . tC l , k (28) pl,1 k

297

Rule 2 - For every third or fourth Cl group l inked to a single carbon group (CH, C, CHC1, CCI, CHCI2, CCI3, CF), a d d - 2 . 6 cal tool 1 K 1.

We introduce two new binary variables to identify the presence of a th i rd and fourth C1 a tom l inked to a given ver tex i.

10 ifPcl, i >-- 3, P3i = otherwise.

10 if ] . tCl , i = 4, P4i = otherwise.

The value of p3i and p4i is set through the following cons t ra in ts

lzCl, i - 2.5 < 2.5P3 i < ~Cl, i, Vi ~ V.

J.tCl, i - 3.5 < 3.5P4 i < PCl, i ,Vi ~ V.

(29) (30)

Then, the contr ibut ion for rule 2 is given by

cO = - 2 - 6 Z ( P 3 i +P4i) (31) pl,2 i

Rule 3 - I f a given CH2, CH2Cl or CH2F group is bonded to at least one OH group, a d d - 0 . 4 6 cal mol I K -1.

We introduce a the b inary var iable ~OH, i,k such tha t

0' if there is an OH group linked to group k at vertex i ~og,i,k = , otherwise

for all ie V and for all ke {CH,CHC1,CH2,C,CC1,CF, CFz. Then,

~OH,i,k <- ZY(i ,k ,a) , ( j , OH,a), Vk ~ {CH, CHCI, CH2 , C, CCI, CF, CF 2 }, Vi V.

J

(33)

The contr ibut ion due to rule 3 is then

-_046V/ o.,i, pl,3 ff[ CH2 + 2j (Y(i, CH2CI,a),(j,OH,a) + Y(i, CH2F,a),(j, OH,a) ) ) (34)

298

Note tha t since CH2C1 and CH2F can be bonded with at most one OH group, there is no need to define a ~OH, i,h variable for these groups.

R u l e 4 - I f a g iven CH, CHCl or CHCle is bonded to at least one OH, a d d 2.5 cal tool-1 K-1.

(35)

R u l e 5 - I f a g iven C, CCI, CC13, CF, CFe, CF3 is bonded to at least one OH, a d d 14.14 cal mol 1 K 1.

C ~ pl,5 = 14.14 Z (~OH, i, C + ~OH, i, CCI + ~OH,i, CF + ~OH,i, CF 2 i + Z Y ( i , CCl3,a),(j, OH,a) + Y(i, CF3,a),(j,OH,a) ) )

J

(36)

12.2.5 S u m m a r y of formulat ion

The formulation involves the following sets and indices

G = {CH3, CH2, CH, C, OH, COOH, CH2C1, CHC1, CC1, CHC12, CC13, CH2F, CF, CF2, CF3, Br, C1} - Indices: k, kk e G

Gcl = {CH2C1, CHC1, CC1, CHC12, CC13, C1} GF -{CH2F, CF, CF2, CF3} V = {1,2,3,4,5} - Indices: i,j,jl,j2 e V O = {1,2,3,4,5} - Indices: oe O

The following variables are defined

�9 ~cl e {0,1}: whether there is a C1 atom in the molecule �9 Ui, k e {0,1}, ke G, ie {1,...,5}: which group is found at which vertex �9 y(i,k,a),(j, kk, a) e {0,1}, k e G , k k e G , i~{1,...,5}, je{1,...,5}: vertex adjacency

matr ix �9 ~OH, i, he{0,1}, ke{CH, CHC1, CH2, C, CC1, CF, CF2}, ie{1,...,5}: whether an

OH group is linked to (i,k) �9 p3i, p4ie{0,1}, ie {1,...,5}: whether 3 or 4 C1 atoms are bonded to the carbon

at a given vertex /4298 cTm Te Tcd T298 �9 H T e " ' v ' p l 'Ps ,Ps ,Pc,Tre,rrcd, ,Tc,Tb,tg,(gE R " properties of the

molecule

299

" C~ o e R, o e {1,..., 7}" heat capacity contributions

" #Cl, i e R,i e {1,...,5}" number of C1 atoms linked to a given vertex

�9 nlk e P, ke G: number of groups of a given type in the molecule

12.3 PROBLEM SOLUTION

The GAMS interface to DICOPT (Discrete and Contituous OPTimizer) is used to solve the resulting mixed-integer nonlinear programming (MINLP) problem (GAMS). Table 3 shows the molecules obtained for the problem described above. The compounds along with their design property estimates are listed in decreasing order of optimality. Integer cuts were used to identify different candidates (Floudas, 1995). The other property estimates for the molecules obtained are listed in Table 4 and the corresponding experimental values are shown in Table 5. Table 6 shows that, on the basis of available data, the largest discrepancies occur in the prediction of saturated vapor pressure, which is systematically underestimated. This is a largely a result of inaccuracies in the critical properties and hence in the acentric factor (negative in two cases), and the reduced temperature values. This level of error is not representative of the overall accuracy of the group contribution methods used, but can be expected due to the small size of the molecules considered. This problem may be circumvented by relaxing the pressure constraints to account for uncertainty in the predictions as suggested by Duvedi and Achenie (1996) for example.

Table 3: Best candidates for the acyclic refrigerant problem, with their design property estimates.

Molecule Objective 7~ cTm re 7ca function Hv pl Ps Ps

value (kJ mo1-1) (cal mo1-1 K -1) (bar) (bar)

CH2F-CH2F (R152) 0.748 30.1 22.5 2.4 6.9 CH3Br (R40b 1) 0.760 23.5 17.8 1.4 6.7 CH2F-CF3 (R134a) 0.821 30.5 25.0 5.1 1 3 . 3

. . . . . . .

Table 4: Property estimates for the candidate refrigerants. Molecule Tb (K) T c (K) Pc (bar) H 298 (kJ mol 1) co 0

CH2F-CH2F 244.8 437.6 47.3 28.2 -0.073 0.559 CH3Br 263.8 423.9 68.8 21.8 0.280 0.622 CH2F-CF3 220.5 376.4 36.5 27.3 -0.043 0.586

300

Table 5: Experimental values oft the candidate properties. All data from SMSWIN unless otherwise indicated, from Afeefy et al (2001), t from Langley (1995)

Molecule T b (K) T c (K) Pc (bar) /4__ v298 (kJ mo1-1) co 0

CH2F-CH2F 283.7 445.0 42.8 *** 0.222 0.638 CH3Br 276.7 467.0 80.0 23t 0.192 0.582 CH2F-CF3 246.7 374.3 40.1 185 0.327 0.659

Table 6: Best candidates for the acyclic refrigerant problem, with design property estimates calculated from the experimentalvalues in Table using Eqs (i 0) and

(12)-(18). Molecule

CH2F-CH2F CH3Br CH2F-CF3

HTe cT~ Te T~d pl Ps Ps

(kJ mol 1) (eal mo1-1 K 1) (bar) (bar)

*** *** 0.6 2.9 24 *** 0.8 3.8 20 *** 2.8 10.8

The best molecule found, 1,2-difluoroethane, (R152), is known to be toxic. The second best molecule, methyl bromide, was officially listed as an ozone depleting substance in 1992. Finally, the third best molecule, 1,1,1,2-tetrafluoroethane (R134a) has been extensively used as a refrigerant replacement for freon 12, and it passes both ozone depletion and toxicity tests. It should however be noted that it is related to another environmental problem, namely global warming.


[1]

[2]

[3]

[4]

[5]

[6]

[7] [s]

H.Y. Afeefy, J.F. Liebman, and S.E. Stein, Neutral Thermochemical Data in NIST Chemistry WebBook, NIST Standard Reference Database Number 69, Eds. P.J. Linstrom and W.G. Mallard, National Institute of Standards and Technology, Gaithersburg MD, (http://webbook.nist.gov)(2001). C.F. Chueh and A.C. Swanson, Estimation of liquid heat capacity, Can. J. Chem. Eng., 51 (1973), 596. N. Churi and L.E.K. Achenie, Novel mathematical programming model for computer aided molecular design, Ind. Eng. Chem. Res., 35 (1996) 3788. M.A. Duran and I.E. Grossman, An Outer-Approximation algorithm for a class of mixed-integer nonlinear programs, Math. Prog., 36 (1986) 307. A. Duvedi and L. Achenie, Designing environmentally safe refrigerants using mathematical programming, Chem. Eng. Sci., 15 (1996) 3727. C.A. Floudas, Nonlinear and mixed-integer optimization: Fundamentals and applications, Oxford University Press, Oxford (1995). GAMS, Generalized Algebraic Modeling System, www.gams.com. R. Gani, B. Nielsen and A. Fredenslund, A group contribution approach to computer-aided molecular design, AIChE J., 37 (1991) 1318.

301

[9] K.G. Joback, Designing molecules possesing desired physical property values, Ph.D. thesis, MIT, Cambridge (1987).

[10] K.G. Joback and G. Stephanopoulos, Designing molecules possesing desired physical property values, Foundations of Computer Aided Process Design, (1989) 363.

[11] B.C. Langley, Fundamentals of refrigeration, Delmar Publishers, Albany, N.Y. (1995).

[12] J. Marrero and R. Gani, Group-contribution based estimation of pure component properties, Fluid Phase Eq., 183-184 (2001) 183.

[13] B.E. Poling, J.M. Prausnitz and J.P. O'Connell, The properties of gases and liquids, McGraw-Hill, New York, 5 th edition (2000).

[14] R.C. Reid, J.M. Prausnitz and B.E Poling, The properties of gases and liquids, McGraw-Hill, New York, 4 th edition (1987).

[15] N.V. Sahinidis and M. Tawarmalani, Applications of global optimization to process and molecular design, Comp. Chem. Eng., 24 (2000) 2157.

[16] M. Sinha, L.E.K Achenie and G.M. Ostrovsky, Environmentally benign solvent design by global optimization, Comp. Chem. Eng., 23 (1999) 1381.

[17] SMSWIN, Computer-Aided Process Engineering Center, CAPEC, Technical University of Denmark, http://www'capec'kt'dtu'dldmain/default'htm

Computer Aided Molecular Design: Theory and Practice L.EK. Achenie, R. Gani and V. Venkatasubramanian (Editors) �9 2003 Elsevier Science B.V. All fights reserved. 303

Chapter 13: P o l y m e r Des ign Case S tudy

P. R. Patkar & V. Venkatasubramanian

12.4 INTRODUCTION

A background of genetic algorithms was presented in Chapter 5 of the book and the adaptation of GAs for the problem of computer-aided molecular design was discussed. The framework for evolutionary molecular design using GAs proposed by Venkatasubramanian and co-workers [1] was presented in detail. The utility of the framework was illustrated by means of two example problems in polymer design. The case study problems demonstrated the success of the genetic design approach in locating optimal designs for the desired target constraints. The advantage of GAs in their ability to discover a diverse population of near-optimal designs was also highlighted.

The sample problems discussed in Chapter 5 were relatively simple from the points of view of the enforced target constraints as well as combinatorial complexity. In the discussion that follows, a bigger problem of polymer design is presented, from work done by Venkatasubramanian et al. [2]. The objective of considering the bigger problem is two-fold: primarily, the investigation of the efficacy of the genetic design system for problems with much larger and more complex design spaces, and second, to describe the extension of the original GA framework by incorporating higher-level chemical knowledge to enable better handling of constraints such as chemical stability and molecular complexity. In the sections that follow, the large-scale polymer design problem is first introduced. The details of the GA implementation are omitted since they are almost the same as those discussed for the case studies in Chapter 5. Results for the standard as well as for the knowledge augmented genetic design framework are presented. Then some aspects concerning parametric sensitivity and robustness of GAs are discussed. Finally, conclusions are offered based on the results of the study.

13.1 POLYMER DESIGN CASE STUDY

Chapter 5 presented two polymer design case studies from investigations by Venkatasubramanian et al. [1]. The genetic algorithm was required to design polymer repeat structure given certain macroscopic property values. The

304

property values of known polymers, namely: (i) Polyethylene terephthalate (PET), (ii) Polyvinylidene propylene copolymer (PVP), and (iii) Polycarbonate of bisphenol-A (PC) were used. The palette of groups for the search was relatively small and involved only four mainchain (>C<, -C6H4-, -C=OO-, -O-) and four sidechains groups (-H, -CH3, -F, -C1). To recapitulate the results, the genetic search discovered all the three target polymers in a fraction of the 200 total generations allowed for all design lengths (maximum repeat structure length, L=2-7 and L=2-10) and for all initial population conditions (random mainchain and sidechain groups, and -CH2- only). For instance, the average generation number for locating the design target first and the success rate (in parenthesis) for an initial population initiated with random mainchain and sidechain groups having lengths 2-7 were: (i) PET 11.3 generations (100%), (ii) PVP 28.2 generations (100%), and (iii) PC 41.0 generations (100%). The GA was also able to determine many high-fitness alternate structures.

l ~ i n c hair, Crm ups II Sidec hain Groups

O O U II

>C< -S- -SO2- -0- -C- -O-C- 0 o 0 0

II II I I II

-O-C-O- -C-O-C- -NH- -C-NH-

O O -@,-

X ~

-@

-H -CH~ -C2 Hs - nC~ H7 -iC~Hr

- ~4 Hs -F -C1 -Br -OH �9 0

II I I

-OCH~ -O-C-CH8 -O-C-OCHs

~O~ -CN

Fig. 1. Extended palette of base groups for the design case study

For the present case study, taken from Venkatasubramanian et al. [2], the design problem was made much larger and the search space more complex by increasing the base group choices to 17 mainchain and 15 sidechain groups. The extended palette of base groups is shown in Fig.1. In the smaller problem, when the base groups consisted of four mainchain and four sidechain groups, the total number of design candidates was about 1.4x105. Under the increased number of mainchain and sidechain groups, the search space was magnified to 1.1x1013 candidates considering design lengths of 2 to 7. Thus, the search space was about 100 million times larger than that in the earlier study. Also, the number of target polymers evaluated was increased from three in the previous study to nine as shown in Table1. The search space was further complicated by the increased number of nonlinear group interactions. For example, for polymer design target 4, the nonlinear van Krevelen group interactions required that every mainchain group, other than the -O- endgroup, and every sidechain group be in their proper

305

pos i t ion in o rde r to give the op t ima l f i tness of 1. T h a t is, t he macroscopic p r o p e r t i e s d e p e n d e d not only on the group types b u t also on t he i r exact ordering in t he t a r g e t molecule .

Table 1. Target polymers and their properties ................. Target Polymer 9, Tg, K a, K- Cp, K, N/m

~,cm 3 .... (X 10, 4) J./kl~.K ,, (x 10 9) ....

H H I I _g-c c-o-c-c4-

/ H / ~ / I.!1 J J / h, u u N N - - , n 1.34 350.8 2.96 1152.67 5.18

TP1

_•1H F H H -

Li~ F H ~H3-ln

TP2

c.~

k 6 ~ / ~H3 ~ J n

1.18 225.2 2.81 1377.82 2.51

1.21 420.8 2.90 1135.10 5.40

TP3

TP4

1.19 406.8 2.90 1073.96 5.39

1.28 472.0 2.89 995.95 5.31

TP5

J n

TP6

H H H H H H ___~11 I I I I I -1 --C--C--C--C---C--N----lb---

I I I I II / H H H H O . I n

TP7 H H

t , ~ O H H - i n

TP8

1.25 421.1 2.90 1016.55 6.12

1.06 322.3 2.98 1455.90 3.85

1.27 322.1 2.81 1152.67 3.42

r_ . . .~ CH3 r _ _ . ~ CH3

l 1.09 428.7 2.77 1163.10 4.12

TP9

306

p = density, Tg- glass transition temperature, a = thermal expansion coefficient, Cp - specific heat capacity, K - bulk modulus

The number of property constraints was the same as before at five and included the following properties: density, glass transition temperature, thermal expansion coefficient, specific heat capacity and bulk modulus. Predicted values of these physical properties for a given molecular structure were calculated by the van Krevelen [3] group contribution methods.

The second aspect of the case study involved the incorporation of higher-level chemical knowledge, which is discussed next.

13.1.1 Incorporation of high-level knowledge: Molecular Stability

Higher-level chemical knowledge was incorporated to facilitate the search towards more chemically realistic and stable polymers. For example, it is commonly known that certain group combinations such as -O-O-O- and -OC=O- C=O- lead to chemically unstable structures and are therefore undesirable in candidate solutions presented by the design system. In the absence of any inclusion of any such higher-level knowledge into the GA, such group combinations were often found in many high-fitness polymers in the smaller case study [1]. Another example of a practical constraint on a design system is environmental acceptability. Certain molecular groups or group combinations are known to be environmentally toxic or unacceptable. This is a common problem in the design of agrochemicals such as fertilizers and pesticides as well as refrigerants. Yet another important consideration would be the relative ease or difficulty involved in the synthesis or manufacture of the proposed design candidates. It is important to be able to incorporate all such constraints in the design process. In the current study, only stability and molecular complexity constraints were addressed.

In the knowledge-augmented GA framework, chromosomes with unstable mainchain group combinations were assigned zero fitness. As a result of na tura l selection, such solutions were automatically weeded out of the design process and thereby removed from any further consideration. The knowledge incorporated into the algorithm about the stability of nearest neighbor mainchain groups was drawn from Barton and Ollis [4].

13.1.2 Molecular Complexity

Molecular complexity is encoded as a count of the total number of mainchain and sidechain groups and is given by the following equations [5, 6, 7]:

F(x) = F(x)-13 x Sig x Complexity (1)

307

2 (2) Sig = (1 + exp[- 7{F - Fcrit }])

Complexity = MC + SC (3) MCma x + S C m a x

where F is the fitness value, [3 is a penalty scaling factor, Sig is a sigmoidal fitness function, given by equation (2), that provides a fitness threshold, Fcrit, for the genetic algorithm to start penalizing complex designs, and ~ is a decay scaling parameter. The complexity measure, given by equation (3), ranges from 0 to 1 and is given by the ratio of the number of mainchain (MC) and sidechain (SC) units in the current design to the maximum allowable mainchain and sidechain units (32 in this case). Thus, the complexity of a polymer repeat structure is viewed in terms of its 'size' as given by the number of units in the repeat structure. The smaller the molecule, the lower is its complexity. In order to encourage the favoring of simple molecules over more complex ones of comparable fitness, a penalty was applied to the fitness. All molecules having fitness values greater than the threshold Fcrit w e r e penalized as given by equation (1) in direct proportion to their complexity.

13.2 GA B A S E D SEARCH

The evolutionary search approach based on GAs has already been discussed in detail in chapter 5. The same framework was adopted for the larger polymer design problem. Slight modifications had to be made to handle the constraints arising out of molecular stability and complexity or maximum molecular length. These constraints were handled via suitable modification of the fitness function. A penalty was assigned to the overall fitness for design candidates that violated the defined constraints. The penalized fitness function used for this purpose can be expressed as [8]:

P

F(x) = F(x) + e r /~ q~i (4) i=l

where P is the total number of constraints, rl is a penalty coefficient, e is -1 for maximization and +1 for minimization problems, and (pi is a penalty related to the i th constraint. As mentioned before, the penalty was very severe for violation of stability constraints. Chromosomes infeasible with respect to stability were directly assigned zero fitness.

308

The parameter values used for the search are given in Table2. The design lengths varied from two base group units to a maximum of two units more than the polymer design target. The fitness function gain, a was equal to 0.001. The parameters for equations (1), (2) and (3) were as follows: Fcrit- 0.99, which resulted in applying the complexity measure only after near optimal solutions were attained, ~=100 which provided a gradual activation of the complexity measure as the fitness approached the critical value, and 13=0.10 so that a large penalty reduced the overall design fitness to a point where the genetic algorithm considered the design to be unworthy of further consideration.

For statistical significance, results were compiled after 25 runs of 1000 generations each. The genetic design investigations carried out were subdivided into the following scenarios: (i) standard genetic design (ii) knowledge-augmented genetic design, which penalized unstable mainchain group combinations, and (iii) knowledge-augmented genetic design, which penalized unstable mainchain group combinations and molecular complexity.

...................................................... Table 2: GAParameters ....................................................... Parameter

Steady state population Number of generations Gaussian fitness decay rate (a) Complexity sigmoid gain (13) Complexity penalty (~) Maximum polymer length

Elitist retention with respect to population size

Value 100

1000 0.001

0.1 100

Target Length +2 10%

Genetic Operator Probabilities: Crossover 0.2 Backbone mutation 0.2 Sidechain mutation 0.2 Hop 0.2 Deletion 0.1 Blending 0.1 Insertion 0.0

13.3 RESULTS AND DISCUSSION

The results for the different genetic design cases are presented in Table3. The results are arranged in the following manner. The rows labeled part (a) give the percent success rate (in bold text) in achieving the design objective and the

309

number of successful runs (in parenthesis) for each target. Par t (b) presents the average generation when the target was first located (in normal text). The rows labeled part (c) show the average number (in italic text) of distinct high-fitness solutions found for each target.

As was expected, the genetic design was not as successful as it was in the case of the smaller case study, when it located the target molecule in every run (i.e. a success rate of 100%). However, the most important observation here was that the genetic design still succeeded in finding the target molecule for eight out of the nine target polymers, even though the search space had exploded by over a factor of 100 million. As seen from part (a) of the table, with the exception of target polymer 4, all target polymers were located at least once by one of the design scenarios (i.e., columns 3-7).

From part (b) of Table 3, it is seen that some molecules took longer than others to be discovered. For example, target polymer 7 was always found in less than 100 generations. On the other hand, target polymer 6 was located with varying success (4%-68%) and took more than 400 generations for discovery. Typically, longer molecules that required exact mainchain group ordering and sidechain positioning needed more generations to be discovered. This explained why target polymer 7, which was the only target molecule with no group ordering constraint was quickly located while target polymer 6, which required exact ordering, took much longer to discover. The exact ordering requirement and the long backbone structure were also the reasons why target polymer 4 was never discovered in any of the runs of 1000 generations each.

Columns five to seven of Table 3 present results for the knowledge-augmented genetic search where higher-level chemical knowledge about the feasibility and stability of group combinations and molecular complexity were incorporated. One can observe several general trends from these results. It can be seen that the success rates were higher, in general, with the knowledge-augmented genetic design in comparison with the standard genetic design (part (a) of column 3 vs. columns 5 and 7), when the initial population consisted of random mainchain and sidechain groups. Thus, the addition of higher-level chemical knowledge improved the design efficiency. For column 7, since the complexity measure was applied only after the fitness threshold was exceeded, more generations were required to achieve the target. This also attr ibuted as to why the genetic design was unable to locate target polymers number 3, 4, and 9. In summary, it appears tha t the incorporation higher-level chemical knowledge not only produced candidates tha t were chemically feasible, stable, and less complex but also increased the efficiency of the search by eliminating spurious candidates in the genetic design.

310

....................................... Table3: Results for,,the genetic search ............ Standard GA Feasible MC =__=_ ..............

Pa random random random Target Polymer rt random MC, random MC, MC, SC

MC, SC hydrog MC, SC hydrog e n S C e n S C

H H

- -~-~~~-o- '~-~ ' - -q- - (a) / , ~ k ~ / It t i / I = 0 0 H H - I n

TP1 (b) (c)

. F . ~.~J ~

TP2

(a)

Co) (c)

0 ,'~-x c"~/~x -1 (a) -0-o--s ~ L _ ) 2 - 1 - -

TP3 (c)

s

~ ~ ~ i ~ ~ (a) Co) TP4 (c)

- - ~ s o ~ > - ~ o ~ (a)

TP5 (b)

(c)

(a)

TP6 (b)

(c)

H H H H H H _ E l J i I I I " 1 C - - C - - C - - C - - C - - C - - N - - { - - - -

I I I I I It / H H H H H 0 . , I n (a)

TP7 (b)

(c)

60% 60% 64%

12% (3) (15) 28% (7) (15) (16)

184 300 233 240 428 282 192 281 213 166

48% 40% 48% 48%

36% (9) (12) (10) (12) (12) 411 400 209 522 412

6 7 7 6 10

0% (0) 4% (1) 8% (2) 12% (3) 0% (0)

293 640 193 163 91 161 74 109

0% (0) 0% (0) 0% (0) 0% (0) 0% (0)

861 564 910 589 570

56% 48% 48% 92%

32% (8) (14) (12) (12) (23)

400 205 317 232 420

175 136 197 142 99

68%

8% (2) (17) 4% (1) 32% (8) 16% (4)

548 405 529 632 528

199 146 314 168 158

100% 100% 100% 100% 100%

(25) (25) (25) (25) (25)

61 61 58 64 85 217 188 214 198 163

311

Table 3 (continued)

H H I I ~ - - o - c ~ c - o - c - c - - - ~

/ II I( )1 II I I / I , . 0 ~t,...~) 0 H H ~ n

TP8

,______r ,__..__~ CH3

TP9

68% 68% 76% 88% 96%

(a) (17) (17) (19) (22) (24)

(b) 210 88 147 109 81

(c) 162 132 158 161 125

(a) 8% (2) 4% (1) 4% (1) 4% (1) 0% (0)

(b) 382 132 513 868

(c) 144 69 174 70 46 . . . . - - : _ . . . . . . . . . _ = _ _ .

(a) target polymer success rate "bold", times target found out of 25 GA Runs "(parentheses)"; (b) average generation number for locating target polymer "plain text"; (c) number of distinct polymers with fitness >_ 0.99 (0.985 for TP2) "italic text"; MC = mainchain, SC = sidechain.

The results also suggest that the initial polymer population complexity played a role in the success rate of the genetic design. For example, the s tandard genetic design, in general, gave better results when the initial population sidechains were seeded with hydrogen groups (column 3, part (a) vs. column 4, part (a)). Large improvements were seen for target polymer 1 (12% to 60%) and for target polymer 6 (8% to 68%). Similar results were obtained for the knowledge- augmented genetic design that penalized unstable mainchain structures (column 5 part (a) vs. column 6, part (a)). The best improvements were those for target polymer 1 (28% to 60%) and for target polymer 6 (4% to 32%).

Part (c) of Table 3 lists the number of near optimal or high-fitness solutions that were found for each target. This ability of the genetic design system to find many diverse alternative solutions with properties very close to the desired target properties, is one of the most appealing features of the system. The high-fitness threshold was 0.99 for all design targets except for polymer 2, in which case it was 0.985. The genetic design was unable to find alternate solutions with a fitness value greater than 0.99 for this polymer. It should be noted that while the genetic design did not find the exact target for polymer 4, it did locate more than 500 to 900 alternative near-optimal solutions.

13.4.1 N e a r - o p t i m a l s o l u t i o n s

Table 4 presents two of the numerous nearly optimal alternatives for target polymer 4 for each of the scenarios 1-3. As one can see, the alternative solutions

312

were very close to the target properties and had fitness values exceeding 0.99. The average absolute error ranged from 0.25% to slightly over 1.0% of the desired property values. The solutions varied according to the search type. For example, case 1 (basic genetic design) obtained two infeasible polymers. The first used a combination of-O- and >C=O groups instead of the single -O-C=O- group and the second contained a -O-O-O- group combination which was unstable. Using the correct -O-C=O- reduced the fitness to 0.976 and increased the average absolute error to 2.04%. Case 2 produced feasible mainchain structures but were generally more complex than those in case 3, which also considered molecular complexity. The number of near-optimal solutions was approximately the same for all genetic design types.

Table 5 presents corresponding results for target polymer 3. For this target, as in the case of target polymer 4, all alternative solutions had very high fitness values. Furthermore, these alternative solutions were structurally fairly similar to the actual target. It can be easily appreciated that this ability of the genetic design system to deliver a number of nearly optimal solutions structurally similar to the target is of immense practical importance. In several cases, one of the near-optimal candidates could easily turn out to be an attractive and feasible option for further consideration.

13.5 PARAMETRIC SENSITIVITY AND R O B U S T N E S S ANALYSES FOR GA'S

The performance of GA-based strategies is intimately tied to the different parameters employed in the algorithm. These parameters control the various aspects of the algorithm and hence directly govern the outcome of the search. The discovery of an optimal setting for the parameters or even the existence of one can be determined only by experimentation. The results of the GA design system on the case studies though encouraging, were widely varied in terms of success rate as well as the quality of the final solutions obtained. This indicated that to obtain an improvement in performance, a detailed parametric sensitivity analysis needed to be performed. This would help to establish whether an optimal setting could be obtained, independent of the nature of the target structure or design problem. In their previous work, Sundaram and Venkatasubramanian carried out such a parametric sensitivity study in an effort to systematically determine optimal parameter settings [9]. Their investigation also involved a characterization of the search space in order to identify strategies that would allow the GA to exploit the underlying structure of the space. The key results from their work are mentioned below.

313

............................................................ T a . b l_e_4" Near opt.i.mal...so!.u_tio.ns.for ta.._rget.po!ymer.____.._4 ...........................................................

Polymer design % error a Fitness

Target Polymer: TP4 _ _ _ ~ ~ / '~ 'h ~ CH3 r=--x s_ o

{0; 0; 0; 0; 0} 0% 1.0

Case 1" Standard GD b H O H

--HK ( ) k - - - - f : ' ~ " Y - - - ( ( ) k - - C - - O - - C - - - ( ( ) ) - - - - C - - F - I \ ~ - - ~ / I( ) l \ x . j / I ~ l m L'------~ ~ "--"--' H H J n

C2H5 0 II

{-2.2;-0.5; 0.4; 0.4;-2}

0.74% 0.995

{1.6; 2.2;-0.8;-0.2; 0 .9} 0.991 1.18%

Case 2" knowledge-augmented GD, stability

~ OH H , - - - , 0 , - - - , I l / : - ~ \ II / f - ' ~ \ A " 1

C - - C - - - ( ( ) k - - - O - - C - - ( ( J k - - ' - C : ' x " r - - O + -

~ClH3 ~ ~ ~ Jn

O C2H5

~ ] - c - - o - - ( ( I ) - - c - - s - - ( ( ) ~ - ( ( I } - - ~ r - ~ - ~ ( ) H -

{0.04; 0.09;-0.4; 0.09; 0.7} 1.10%

{0.4; 1.9; 0.85; 0.14;-2.2 1.10%

0.999

0.991

Case 3: knowledge-augmented GD, stability & complexity

H O

~-o-~. @@

{-0.1; 0.6; 0.1; 0.08; - 0.04} 0.21%

0.999

CH3 0 0 _ _ [ ~ ~ o _ ~ . . _ k / ~ o _ ~ _ _ ] _ , {0.4;0.83%-1.0, 0.02; 1.8;-0.9} 0.999

..... a% Error is f0r {p; Tgi :ai Cpi K} averageabsoluteerr0r %. b GD"= genetic design:

The study clearly highlighted the absence of a single optimal setting for the parameters examined. In fact a parameter setting found to work very well for a particular target was found to be non-optimal for a different target. The results implied that an optimal tuning of parameters could be done only on a run-to-run basis. The target-specific nature of the optimal parameter settings exposed an important aspect of the algorithm: the nature of the search space critically influenced the mechanics of the GA. The search-space characterization study illustrated that the structure of the fitness landscape was drastically altered by the target property settings. While in some cases, the landscape was amenable to

314

search using convexity based algorithms, in other cases, it remained rather flat but reasonably correlated for small changes. The most important insight provided by the study was that the breadth as well as the depth of the sampling of chromosomes is crucial to performance of the GA. Stated differently, the diversity of chromosomes sampled during the search is important not only in terms of variety of the samples in terms of their distances in the search space but also in terms of the necessary number of samples at a given distance of separation. This becomes even more profound under non-binary genetic encoding.

Table 5: Near optima_! solutions for target polymer 3 . . . . . . . . . . . . . . . . .

Polymer design % error a Fitness

Target Polymer: TP3

.-~ cH3 .---. _ c-o-((Q%c--,/~'~]- {0; 0; 0; 0; 0} 1.0

0%

Near-optimal solutions O ~ ~ C 2 H 5 0

r II / f - ~ \ / F ~ \ I I

0 F

t- C3H~ ~ "- - - - ' --, n

O C2H 5

- - t - c - - o - - - ( { ) Y - - - C - - S - - - ( { ) )-----( { ) Y - - - C f - - x " v - - K { ) y - A -

k ~ ~ ~ "~--/' UI~ ~"~--/J n

{0.58; 0.22; 0.89;-1.3;

0.09} 0.997 0.62%

{-0.95; 0.3; 0.68;-0.4; - 1.5} 0.996 0.76% {-0.61; 0.56; 1.2;-0.09;

2.1} 0.993 0.92%

{-1.9; 0.34;-0.5;-2;-0.5} 0.992 1.05%

a % Error is for {p; Tg; (z; Cp; K} average absolute error %.

In addition to the issue of parametric sensitivity, another important concern relates to the robustness of the genetic search method, in fact any design system, to uncertainty in the forward prediction model, which is used for fitness evaluation. Every forward model has some level of error associated with it. Depending upon the type and complexity of the property or performance measure at hand, the predictions of a model may be as much as 10-15% off the true values. While such high degree of error may not be present in predictive models for simpler properties such as density, there would surely be some error. The presence of error may be viewed as uncertainty in the forward predictions. Then the practical utility of a design system would be related to its performance under

315

such uncertainty. In a recent work, Pa tkar and Venka tasubramanian [10] studied the robustness of genetic algorithms to model uncertainty in molecular design. The study was carried out using the large polymer design case study. The results were highly encouraging and indicated an overall robust performance of the GA-based design system. For the target polymers considered, the system was able to enjoy success at errors even as high as 10% error in the forward model.


The performance of a GA-based approach for large-scale molecular design was investigated with the help of a large polymer design case study. The total number of solution candidates in the present problem was about 100 million times larger than in the example discussed in chapter 5. It was found that, despite the tremendous increase in the search space size and the complex nonlinear group interactions, the genetic design was generally able to find the target molecules. Furthermore, it was also able to provide a diverse collection of design alternatives, which nearly satisfy the property constraints. However the algorithm enjoyed a much less success rate and was much slower in terms of convergence compared to the smaller problem.

The versatil i ty of the genetic search methodology was i l lustrated in terms of its easy extension to include higher-level chemical knowledge. The objective of incorporating such knowledge was to ensure that more realistic, stable, and less complex solutions were obtained from the search. The results indicated tha t the inclusion of knowledge not only eliminated the creation of chemically infeasible structures as expected, but also improved the overall efficiency of the genetic design. In other words, not surprisingly, the search turned out to be more intelligent than in the absence of additional knowledge.

It was evident from the case studies that the genetic design system was extremely proficient at rapidly locating favorable regions in the design space. It was, however, less effective at performing very localized searches. This was seen in many design scenarios where the optimal design could be reached by three or four genetic operations but took the algorithm several hundred generations to realize the target. This strongly indicated that tuning the parameters could significantly improve performance. However parametric sensitivity studies indicated the absence of a single optimal parameter setting. The best settings changed from one target to another and could be determined only by experimentation.

The issue of the performance of GAs under forward model uncertainty was briefly addressed. Results from a recent study are encouraging and indicate significant robustness on the part of the genetic design system.

316

In conclusion, the problem independent, efficient nature of the versatile genetic approach and the ease with which chemical, biological, design or process knowledge and constraints can be incorporated make the genetic design framework very appealing for CAMD and worthy of further investigation for large-scale molecular design problems.

13.7

F Fcrit (z

Y

CAMD GA(s) PET PVP PC MC SC

LIST OF SYMBOLS AND ABBREVIATIONS

fitness value fitness threshold decay rate for Gaussian fitness function penalty scaling factor for complexity complexity gain penalty coefficient for modified fitness function penalty related to the i th constraint Computer-Aided Molecular Design Genetic Algorithm(s) Polyethylene terephthalate Poly(vinylidene propylene) copolymer Polycarbonate of bisphenol-A mainchain sidechain


1. V. Venkatasubramanian, K. Chan and J. M. Caruthers, Comput. Chem. Eng., 18 (1994) 833-844.

2. V. Venkatasubramanian, K. Chan and J. M. Caruthers, J. Chem. Info. Comput. Sci., 35 (1995) 188-195.

3. D. W. van Krevelen, Properties of Polymers; their Correlation with Chemical Structure; their Numerical Estimation and Prediction from Additive Group Contribution, 3 rd Ed., Elsevier, Amsterdam, The Netherlands, 1990.

4. D. Barton and Ollis, W.D. (Eds.), Comprehensive Organic Chemistry: The Synthesis and Reaction of Organic Compounds, First Edition, Pergamon Press, New York, 1979.

5. E.A. Brignole, S. Bottlini, and R. Gani, Fluid Phase Equil. 29 (1986) 125- 132.

6. K. G. Joback and G. Stephanopoulos, FOCADP '89, Snowmass, CO, 1989. 7. S. Macchietto, O. Odele and O. Omatsone, Chem. Eng. Res. Des., 68, 5

(1990) 429-433. 8. R. Gani and E. A. Brignole, Fluid Phase Equil. 13 (1983) 331-340.

317

9. A. Sundaram and V. Venkatasubramanian, J. Chem. Inf. Comput. Sci., 38 (1998) 1177-1191.

10 .P .R . Patkar and V. Venkatasubramanian, AIChE J. (submitted for publication, 2002).

Computer Aided Molecular Design: Theory and Practice L.E.K. Achenie, R Gani and V. Venkatasubramanian (Editors) �9 2003 Elsevier Science B.V. All fights reserved. 319

C h a p t e r 14: Case S tudy in Ident i f i ca t ion of M u l t i s t e p R e a c t i o n S t o i c h i o m e t r i e s

A. Buxton, A. Hugo, A.G. Livingston & E.N. Pistikopoulos

14.1 INTRODUCTION

In this chapter, the systematic procedure for the rapid identification of environmentally benign alternative multi-step stoichiometries, as described in Chapter 7, is applied to a case study- the production of acetic acid. Acetic acid is one of the most important aliphatic intermediate compounds with various of its esters being important for artificial silk manufacture and used as solvents for resins and paints. Its inorganic salts are used in the dye and clothing industries and in medicine. The scale of production of this molecule makes this an interesting example from the environmental point of view. The background and chemical routes for this example were adapted from Weissermel and Arpe, (1993).

14.2 PROBLEM FORMULATION

The problem addressed here may be stated as follows:

Given a desired organic product

Identify a set of candidate multi-step organic reaction stoichiometries for the production of the desired product which are both economically and environmentally promising.

This requires a three step procedure: (i) selection of co-material groups, (ii) determination of a set of candidate co-materials, and (iii) identification of a set of promising candidate multi-step stoichiometries.

The use of such a structured, stepwise procedure reduces the multi-step stoichiometry identification problem to a manageable size. The key to the procedure is the introduction of co-material design (steps (i) and (ii)). With the product and stoichiometric co-materials known, the identification of feasible re-

320

action stoichiometries is no longer an open ended problem. The steps of the procedure are described in the following sections.

14.3 METHODOLOGY

As described in Chapter 7, the first step in the methodology is the application of a new group based co-material enumeration algorithm. By introducing material design principles, through structural and chemical feasibility constraints, a manageable set of raw materials and co-products can be generated. Next, stoichiometries are extracted from the co-material set using a two step optimisation procedure, including whole number stoichiometric coefficient constraints, carbon structure constraints and case specific constraints based on chemical knowledge. Thermodynamic, economic and environmental impact criteria are employed in the evaluation of feasible stoichiometries, with aspects of the Methodology for Environmental Impact Minimisation (MEIM) (Pistikopou- los et al. , 1994) providing the framework for the environmental evaluation of alternatives. In terms of each of these steps, the particular specifications used in the case study follows.

GROUP PRE-SELECTION There are five established routes to acetic acid, these are shown in Figure 1. As before, for simplicity group pre-selection was restricted to identifying the simplest set of UNIFAC groups necessary to represent the product and the co- materials involved in these stoichiometries. As a further simplification, the chemistry specific intermediates peracetic acid and 2-acetoxybutane were not considered as part of group pre-selection since it is unlikely that they would be produced and consumed in different stoichiometries which lead directly to the desired product. Accordingly, the following thirteen groups were selected: CH3-, -CH2-, -CHO,-CO2H, CH3COO-,-CH=CH-, CH3CO-, HCOO-, CH2=CH-, -OH, H20, CH~OH, HCOOH. The latter three groups are complete molecules selected from class zero in Constantinou et al. (1996), no category two groups are featured in this example.

CO-MATERIAL DESIGN Since the established chemistries involve only unbranched acyclic molecules (disregarding 2-acetoxybutane), the co-material enumeration problem was solved for such molecules only, including the following additional structural restrictions based on the established co-materials; (i) an upper limit of four groups per molecule is imposed, and (ii) only one oxygen containing group is allowed per molecule, since more complex molecules than this are unlikely raw materials and the common industrial by-products are simpler than the product (mostly CO2 and HCQH).

321

Oxidation of Acetaldehyde

ct3cno + 02 --~ cn3co-o-on Acetaldehyde Peracetic Acid

CH3CO-O-OH + CH3CHO ---> 2 CH3CO2H Acetic Acid

Operated by: UCC (USA), Daicel (Japan) and British Celanese (UK)

Oxidation of Alkanes (n-Butane)

CH3(CH2)2CH 3 + 2.5 02 ---> 2 CH3CO2H + H20 n-Butane Acetic Acid

Operated by: Hoechst Celanese, Hills and UCC(USA)

Oxidation of Alkenes (Butenes)

cn3cn2c~I-Cn 2 + cn3co2n --) cn3cn2.cncI~ 3 /

CH3CH=CHCH 3 O2CCH 3 l-Butene or 2-Butene 2-Acetoxybutane

1 CH3CHTCHCH 3 + 2 02 ----> 3 CH3CO2H

/ Acetic Acid O2CCH 3

Operated by: Bayer and Hills

Carbonylation of Methanol

CH3OH + CO ---> CH3CO2H Operated by: BASF and Monsanto

Isomerisation of Methyl Formate CH3OCH O ---> CH3CO2H

Not Yet Commercialised

Figure 1: Acetic Acid Production Routes

ROLE SPECIFICATION CONSTRAINTS According to the industr ia l routes, stoichiometries of up to two steps in length were allowed, with a m a x i m u m of four species permit ted in any step. Table 1 shows the knowledge based role specification constraints employed in the acetic acid example where, as before, R denotes r eac t an t only, P denotes the final product, C denotes product or co-product, N denotes the exclusion of a species from a sys t em and a b lank space denotes no restr ict ion. These cons t ra in t s were aga in developed specifically for two step s toichiometr ies according to the following argument s , based on chemical knowledge and the exis t ing indus t r i a l chemist r ies .

322

Table 1: Role Specification Cons t r a in t s - Carbaryl Example

System 0

1A& 1B

Species 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 R C R R R C ! C R P N R R N R N N R N R N N R R N R N N N

i C R R R C CC C R N R C N C R R C N C R N C C C C C

�9 Alcohols (species 1, 13 and 18) oxidise to aldehydes and then to carboxylic acids in two steps and so are included as reactants only in systems 1A and 1B, and excluded altogether from system zero (except methanol, species 1, which is allowed as a reactant in system zero for carbonylation directly to acetic acid, and is unrestricted in systems 1A and 1B).

�9 Accordingly, aldehydes (species 8, 14 and 19) are included as products or co-products only in systems 1A and 1B and reactants only in system zero.

�9 Unsa tura ted molecules (species 11, 17 and 22) may be reactants only in all systems, their formation is not considered.

�9 Alkanes (species 12 and 23) may be oxidised directly to acids, therefore they are included as raw materials only in system zero, and excluded from systems 1A and lB.

�9 Higher carboxylic acids (species 15 and 20) are unlikely raw materials and undesirable co-products for a promising stoichiometry, they are therefore excluded altogether.

�9 Formates (species 24, 25 and 26) and acetates (species 10, 16 and 21) are esters of formic and acetic acids respectively. They are therefore unlikely raw materials, and due to the conditions necessary for esterification (concentrated sulphuric acid) they are also unlikely co-products. They are therefore excluded from system zero (except methyl formate, species 10, for isomerisation) and included only as products or co-products in systems 1A and lB.

�9 Formic acid (species 7) is included as a co-product in system zero, since it is a recognised industrial by-product, and is included as a reactant only in systems 1A and 1B to allow the generation of formates.

�9 Ketones (species 29 and 30) are produced by oxidising secondary alcohols. No such alcohols are included here so that these species are excluded from system zero, and included only as products or co-products in systems 1A and lB.

323

�9 H 2 0 and C 0 2 (species 2 and 6) are included as co-products only in all systems according to the industr ia l chemistries.

�9 C O , 0 2 a n d / / 2 are included as reac tants only in all systems.

CHEMISTRY CONSTRAINTS Knowledge based chemis t ry constraints were employed using the the b inary product and reac tan t reac tan t flags, is and i is respectively, found in the whole n u m b e r s toichiometry constraints as defined in Chapter 7. It is wor th recalling tha t the b inary variable i is takes the value zero if species s is a product and un i ty if species s is a reactant , while zero or uni ty gets assigned to is when s is a r eac tan t or a product, respectively.

�9 alcohols, alkenes, a lkanes and aldehydes may not react wi th each other

i i l + ii9 + ii13 + i i l s + i i l l + ii17 + ii22 + ii12 + ii23 + i is + ii14 + ii19 _~ 1 (1)

�9 carbonylat ion (reaction with carbon monoxide) is res t r ic ted to alcohols and formates

ii3 - ( i i l + ii13 + i i l s + ii24 + ii27 + ii2s + ii5) ~_ 0 (2)

�9 formates m u s t e i ther react with oxygen or carbon monoxide or undergo isomerisat ion

ii24 + ii27 + ii2s -- ii3 -- ii4 ~_ 2 -- E iis (3) 8

�9 formates may be produced only by esterification of formic acid wi th the appropr ia te alcohol

2i24 - ii7 - i i l s ~_ 0 (4)

2i27 - ii7 - i i l ~ 0 (5)

2i2s - ii7 - ii13 ~_ 0 (6)

�9 aldehydes may only be produced by oxidation of the appropr ia te alcohols or oxidation or hydra t ion of the appropr ia te u n s a t u r a t e d compounds

2i8 - ii13 - i i l l - ii17 - ii22 - ii2 - ii4 ~ 0 (7)

2i14 - i i l s - i i l l - ii22 - ii2 - ii4 ~_ 0 (8)

2i19 - i i l7 - ii4 <_ 0 (9)

�9 methano l may only be produced from synthesis gas

2il - ii5 - ii3 ~_ 0 (10)

324

In addition to these constraints, to prevent the formation of carbon-carbon bonds, the carbon structure constraints given in Chapter 7 were employed in a slightly modified form to allow the carbonylation of methanol. Finally, a production rate (crvp) lower bound of 2.5 kmol/hr and an allowable reactor temperature range of 300-800K were imposed.

14.4 R E S U L T S

CO-MATERIAL DESIGN Applying the co-material design procedure, twenty-one co-materials were constructed. Methanol, water and formic acid are included as additional molecules from class zero in Constantinou et al. (1996) according to the established routes. Carbon monoxide, oxygen, hydrogen and carbon dioxide are included as further additional molecules. All twenty-eight co-materials are shown in Table 2.

MULTI-STEP STOICHIOMETRY IDENTIFICATION RESULTS The solutions of the stoichiometry identification program are again presented in the form of a table of stoichiometric coefficients in Table 3, where blank spaces indicate zero coefficients and the species are numbered as above.

Table 2: Co-Material Design Results - Acetic Acid Example

1) CH3OH Methanol 15) CH3CH2CO2H Propanoic Acid 2) H20 Water 16) CH3CH202CCH3 Ethyl Acetate 3) CO Carbon Monoxide 17) CH3CH=CHCH3 2-Butene 4) 02 Oxygen 18) CH3(CH2)2OH Propanol 5) H2 Hydrogen 19) CH3(CH2)2CHO Butanal 6) CO2 Carbon Dioxide 20) CH3(CH2)2CO2H Butanoic Acid 7) HCO2H Formic Acid 21) CH3(CH2)2OOCCH3 Propyl Acetate 8) CH3CHO Acetaldehyde 22) CH3CH2CH=CHCH3 2-Pentene 9) CH3CO2H Acetic Acid 23) CH3(CH2)2CH3 n-Butane 10) CH3OOCCH3 Methyl Acetate 24) CH3(CH2)2OCHO Propyl Formate 11) CH3CH=CH2 Propene 25) CH3OCHO Methyl Formate 12) CHnCH2CH3 Propane 26) CH3CH2OCHO Ethyl Formate 13) CH3CH2OH Ethanol 27) CH3OCCH3 Propanone 14) CH3CH2CHO Propanal 28) CH3CH2COCH3 Butanone

System zero produced nine candidate stoichiometries that satisfy all constraints, in which materials 1, 3, 4, 8, 11, 14, 17, 19, 22, 23 and 27 appear as first generation precursor reactants. Due to the large number of role specification and chemistry constraints, the imposition of carbon structure constraints, and the simplicity of the species, systems 1A and 1B produced only seven further stoichiometries for the production of species 1, 8, 14 and 19. All stoichiometries

o O

O

O

m

0

~".

t~

0 o

~

t~

0 ~ r12

~ m

Spec

ies

Inde

x]N

sp~

112

3 41

16

1718

1911

01

I I

I I

I I

1181

12

01

1221

] 2

4125

1261

2712

8 " T

oi~e

~ er

vp

I Prof

it$/m

ol ]

CTW

M

" 5

11 1

2 13

14

15 1

6 17

19

21

23

km

ol/h

r [

[ tna

i~/m

ol

Syst

em 0

- Pr

oduc

ing

Spec

ies

9 A

2

1 B

3 -1

-1

1

C 3

-2

2 -1

D

3

[ -3

4

-2

E 3

~ -1

-2

2

F 4

-3

1 2

-1

G

4 i

2 -5

4

H

4 -3

2

2 -2

I

4 ~

-2

1 1

-1

,, Sy

stem

1 -

Prod

ucin

g Sp

ecie

s 1

J I

3 1

I-1

] -2

]

] ]

] I

] I

I ]

Syst

em 1

- Pr

oduc

ing

Spec

ies

19

K

[ 3

" [

-1

[ [

[ [

[ [-

2 2[

]

Syst

em 1

- Pr

oduc

ing

Spec

ies

8 ,,

L 3

-1

2 -1

M

4

-1

1 1

-1

N

4 ~

2 -1

2

-2

0 4

-3

2 2

-2

,,

-1

300

10.0

0 -0

.012

5 1.

68

300

10.0

0 0.

0398

0.

91

300

20.0

0 0.

0359

13

.59

]300

40

.00

0.01

81

7.17

13

00

20.0

0 0.

0083

6.

27

]300

20

.00

0.05

15

21.6

9 13

00

40.0

0 0.

0440

16

.43

! 30

0 20

.00

0.04

63

22.4

7 i! '3

00

10.0

0 0.

0787

31

.88

,,

I I

aoo

9.s4

I-

0.00

241

a.96

1 I

" 30

0 20

.00

IO.O

3571

3.

53

300

20.0

0 0.

0276

2.

40

680

2.50

0.

0484

31

63.8

1 43

5 5.

00

-0.0

025

377.

02

300

20.0

0 0.

0703

20

.69

c~

~ C~

0 ~,,w.

w,,~

.

0 C~

0 r~

Syst

em 1

- Pr

oduc

ing

Spec

ies

14

V

] 4

" ]

] ]-

1]

] ]

{1{{

I

I I

{1{

{ I

] {

I {

{-1{

{ ]

{ ]

{ "6

80

{ 2.

50

]0.0

484]

31

63.8

1 m

326

molecule is different. This eventuality was not accounted for in the stoichiometry identification formulation. Since the integer cuts are written to prevent any stoichiometry from occurring more than once, no stoichiometries producing species 14 were found and stoichiometry P was added manually after solving the problem. In principle, the integer cuts could be modified to avoid this problem by including information to identify the target molecule, so that only repeated stoichiometries used to produce a previously targeted molecule are excluded.

All five established chemistries shown in Figure i were reproduced; stoichiometry A representing the isomerisation of methyl formate, stoichiometry B representing the carbonylation of methanol, stoichiometry G representing the oxidation of n-butane, and stoichiometries E and G representing the oxidation of acetaldehyde and butenes respectively, if in somewhat reduced form (i.e. without the inclusion of intermediates).

Table 4 shows the total profits and impacts for the individual solutions combined in to multi-step stoichiometries. As before, the profits reflect only the values of the products minus the values of the reactants, assuming that stoichiometric co-products are sold at their market value. Once again, raw materials are assumed input waste free. As before, stoichiometries with poor conversion are not penalised.

Table 4: Total Profits and Impacts

Index

A B J C D K E L E M E N E O F G H P I

Total Profit $/mol

Total CTAM tnair/mol

-0.0125 1.68 0.0373 4.88 0.0359 13.59 0.0359 8.94 0.0359 0.0568 0.0058 0.0787

8.67 3170.08

383.29 26.96

0.0515 21.69 0.0440 16.43 0.0947 3186.28 0.0787 31.88

14 .5 CONCLUSIONS

In the example presented in here, there is a noticeable variation in both total profit and impact figures so that the most promising solutions can be more

327

easily identified. Clearly, stoichiometries involving steps M, N and P can justifiably be eliminated from further consideration on impact grounds, these steps being penalised in impact terms by high reactor temperature, and stoichiometry A can justifiably be eliminated on economic grounds.

Of the remaining eight stoichiometries, the carbonylation of methanol (step B) with the addition of step J to produce methanol, exhibits the lowest impact of all. Although the profit of stoichiometry BJ is only half that of the highest, its impact is so much lower that it represents the best compromise solution. Once again, this clearly illustrates the advantages of considering multi-step production routes, since step J has a negative profit. Stoichiometries C, DK and EL all exhibit both higher impacts and lower profits than BJ, and while stoichiometries EO, F, G and I exhibit higher profits, their impacts rise in parallel with these profits.

Thus, stoichiometry BJ is most worthy of further investigation, with stoichiometries EL, DK, C, G, F, EO and I representing progressively less promising alternatives. These results highlight how the technique can assist the identification of a small number of alternative stoichiometries which are promising both in terms of economics and environmental impact. Moreover, the application has been shown that developing multi-step stoichiometries directly can lead to the acceptance of alternatives which would be rejected as single step syntheses.


Constantinou, L., K. Bagherpour, R. Gani, J.A. Klein and D.T. Wu. Computer Aided Product Design: Problem Formulations, Methodology and Applications. Computers chem. Engng 20(6), 685-703 (1996) Pistikopoulos, E.N., S.K. Stefanis and A.G. Livingston. A Methodology for Min- imum Environmental Impact Analysis. AIChE Symposium Series, Volume on Pollution Prevention through Process and Product Modifications 90(303), 139- 150 (1994) Weissermel, K. and H.-J. Arpe. Industrial Organic Chemistry. Second, Revised and Extended Edition. VCH, Weinheim FRG (1993)


C h a p t e r 15: M o l e c u l a r D e s i g n o f F u e l A d d i t i v e s

A. Sundaram, V. Venkatasubramanianan & J. M. Caruthers

Computer-aided product design for performance usually involves evolving the best combination from existing prediction methods and search schemes to navigate the space of possible solutions. While the use of existing techniques may serve expediency well, it often comes at the cost of prediction inaccuracy and th einability of the proposed designs to meet actual performance requirements. In this paper we describe the development and implementation of an integrated approach that develops accurate prediction models and efficient design strategies using "design-relevant" functional descriptors and their associated structural building blocks. The forward (prediction) models are constructed to be an optimal trade-off between accuracy and robustness under a hybrid first principles and neural network framework. The phenomenological component is structured to mirror the product formulation process wherein smaller units (electrons, atoms, molecular fragments) are heirarachically modifed to obtain desired performance contributions in larger units (molecules and formulations) that contain them. The nonlinear and often uncertain influence of these contributions on the product performance is then built into the model by a correlative/neural-network approach. An evolutionary design/search strategy is used to reconstruct the molecular solutions models guided by performance objectives. The search strategy is customized to retain feasibility and navigate efficiently by using the design- relevant building blocks. The implementation of this CAMD approoach and its effectiveness in designing novel and synthetically feasible fuel-additives is demonstrated in this paper.

15.1 INTRODUCTION: PRODUCT DESIGN FOR PERFORMANCE

Product design and development is an important and strategic activity in the chemical and pharmaceutical industry. It is also expensive and time- consuming often costing millions of dollars and several years in development. The problem of product design involves the identification of formulations that match or closely approximate the desired performance characteristics of the product. This includes characterization of product performance, development of testing and measurement techniques, construction of design candidates and screening for closeness to desired performance levels. Depending upon the domain of application, screening the product candidates might involve use of methods ranging from predictive models to expensive field-testing. Hence,

330

computer-aided product design and more specifically CAMD has been an important area of engineering research over the past several years.

The problem is comprised of two parts. The so-called forward problem, which is the computation or prediction of product performance measures from the product formulation or molecular structure, and, the inverse problem, which is the identification of the appropriate product formulation given desired property requirements [1]. The focus of the case study in this chapter is the fuel-additive, which falls under the category of an engineering material. The focus of the design is towards on field performance and not some inherent property of the material that exists totally independent of its interaction with its application environment. The design of engineering materials present the following questions: (i) What are the performance indicators? (ii) Can they be reliably determined with a fundamental approach for a given design? (iii) What is the testing necessary to obtain the performance measure and what is the quality of the experiments/data? (iv) What are the factors affecting synthesis of proposed designs ? The above questions contain in them the keys to constructing the strategy to solve the forward and inverse problems. In this respect, both the forward and the inverse problem have capture within them the functional aspects of how structure relates to performance. This involves the identification of sub-structural abstractions (building blocks) that translate function into structure through a mechanistic understanding of product performance. The above questions also reflect the uncertainties and unknowns specific to the design problem at hand, and the solution framework must be robust enough to handle them as well.

15.1.1. Design-Relevant Building Blocks

This chapter deals with a case study in the computer aided molecular design of fuel-additives. Fuel-additives are, but one example, in the ever-growing class of engineering materials. The design problem involves some aspects well addressed by a traditional CAMD approach and some that are not. The idea behind the approach presented here is the identification and modeling of design-relevant building blocks. These structural (and functionla) elements tie together the forward and inverse problems. They reflect in a transparent fashion, the design process undertaken by an expert formulator in the area. The schematic in Figure 1 further explains this idea. The figure shows the forward and inverse approaches for molecular design along with the formulation and testing cycle for the example of a fuel-additive. It is clear from the figure that both the forward and the inverse approaches are bound to the hierarchy of formulation via key building blocks in the designing process. In some cases these building blocks are quite straightforward. For example, in a mixture design problem based on existing compounds, the building blocks are just the decision variables of mixing fractions. However, in a more involved problem such as that of additive design here, the building blocks can be anything from atoms to molecular fragments to synthesized formulations,

331

all of which influence the final performance. More importantly, these building blocks are not determined by either the forward or inverse strategies but by the formulation process and the physical phenomena at play in the interaction between the product and the intended environment of its use. The most accurate forward models may involve the consideration of electronic and atomic descriptors but the level of control accorded to the formulator (synthesizability constraints) may exist at the molecular level. On the other hand, using only molecular level descriptions might not capture the essentials of the physics that t ranslates structure to property and further into performance. These sub-structures that behave as ideal performance consolidators for the formulation are the design-relevant building blocks of the CAMD problem. The contributions from these building blocks to eventual product performance through a phenomenological model or description are the functional descriptors.

Figure 1: Forward and Inverse Problems in Additive Design: Parallels to Formulation and Testing Cycle

The design-relevant building blocks and their corresponding functional descriptors are identified in Figure 1 for a general fuel-additive design problem. It is noted here that the functional descriptors are holistic and are usually obtained through a first-principles model. In this work, the approach was to mirror the design effort of the formulation chemist as closely as

332

possible and then incorporating the model of the underlying physical phenomena to guide that computer-aided process. This gives a dual advantage of building in implicit synthesizability constraints and developing accurate and, more importantly, sensitive decision variables for the inverse design process. Once these design-relevant blocks are ascertained, the approach in the forward problem is to de-couple the fundamental and correlative aspects of performance prediction using a hybrid model. The inverse problem is handled by an evolutionary design approach for reasons outlined later in the chapter. Primarily, the evolutionary strategy is flexible enough to completely disengage the inverse problem from the functional characteristics (linear, nonlinear etc) of the prediction algorithm. It allows the designer, the freedom to completely replace the forward model in the future with minimal implementation changes in the evolutionary scheme. When progress in experimentation affords prediction models richer in detail and complexity or when other performance metrics can be modeled, the design framework is essentially plug and play. The designer can then quickly discern how design decisions are altered by the newly modeled effects and different performance criteria.

15.2 P R O B L E M DEFINITION: DESIGN OF FUEL ADDITIVES

Fuel-additives are a class of performance modifiers that are added to gasoline to enhance certain properties and/or to provide additional properties not present in the gasoline. Fuel additives are used as combustion modifiers, anti- oxidants, corrosion inhibitors and deposit control detergents. This effort is focused on the design of fuel-additives that control the deposit formation on the intake-valves of the automobile. Figure 2 shows the schematic of the position of the intake-valve and the interacting components in the automobile. It also shows a schematic of the intake-valve and manifold. The intake-valve forms the opening into the combustion chamber. The fuel-injection nozzles spray gasoline directly on the intake-valve. When the valve opens, it draws in a mixture of fuel and air into the combustion chamber where it is burned to supply power to the automobile. Over a period of time, deposits form on the surface of the intake-valve [2,3]. These are the intake-valve deposits (IVD). These deposits have been documented to affect driveability, cold-start efficiency, knock characteristics and emissions [4,5,6].

The US Environmental Protection Agency (EPA) [7] has adopted a s tandard test to determine the deposit forming tendency of fuel package (gasoline + additives) before approving commercialization of the package. The EPA adopted the ASTM Standard BMW-IVD test [8] formally, as the performance indicator for the fuel package containing the gasoline and the additive formulation. A 4-cylinder 1985 BMW vehicle is operated over the road for a total of 16,093 km. The daily test cycle consists of 10% city, 20% suburban and 70% highway mileage with an overall average speed of 45 mph. A fuel package must produce an average deposit of less than 100mg/valve for

333

certification [3]. The function of the class of fuel-additives we are interested in, is the prevention of deposition on the valve by aggregating the precursors of deposit formation produced when the fuel flows through the intake valve and holding them ins solution. The mechanism of intake-valve deposit formation is a complex one. Additive and fuel chemistry, operating conditions and flow properties of the additive, fuel and oil all play significant and interacting roles in determining the nature and amount of deposit. The test itself is quite expensive, costing about $8000-$10000 per run. In addition, the above mentioned controlling parameters are not measured in consistent accuracy if at all. Sometimes alternate tests that are simpler and less expensive are used in lieu of the regulatory benchmark to lower the expense of the design cycle. These factors make the available formulation vs. performance data sets, sparse, noisy and at best mildly inconsistent.

Figure 2: Intake-Valve and Combustion Chamber Manifold

At the outset, we are given the chemical make-up including detailed structures of the fuel-additive molecules that comprise a given database of engine test results. The engine test results are from the BMW engine test runs or from equivalent engine tests. The engine test results are in the form of an intake-valve deposit measurement after the standard test run. The database also contains the fuels used in the engine and some fuel characteristics such as the boiling curve etc. Approximate values of the operating conditions such as temperature are also reported in these databases. These databases form the starting point of the additive design case study. The fuel-additives are actually packages consisting of more than one component/molecule. However, without loss of generality it can be assumed that the principle function of deposit prevention/removal is performed by core

334

group of molecules in the package and all of them have or similar functional features that contribute to their performance. The molecular problem then consists of two parts.

i. Given the structure of the fuel-additives and their dosages in a formulation package, predict to the level of accuracy of the engine or fleet test result, the expected intake-valve deposit on the BMW or an equivalent test.

ii. Given a set of operating conditions that include fuel characteristics and an intake-valve deposit cut-off or similar criteria, determine the molecular structures of the additives that will or at least expected to meet the criteria under those conditions.

15.3 CAMD P R O B L E M F O R M U L A T I O N

Given the expensive testing methods and the apparently intuitive nature of the formulation process, the design of fuel-additives stands to gain tremendously from a computer-aided design process. The aim of this CAMD process is to rationalize both the forward and the inverse approaches to the product design problem (steps (i) & (ii) above). The CAMD formulation in generic terms follows in a straightforward manner from the problem definition. This is depcited with problem specific detail in Figure 3. The different components of the CAMD formulation are as follows:

15.3.1 F o r w a r d P r o b l e m

.

.

Identify the dominant mechanism that determines the performance of the final product in this case the fuel-additive. Determine the structural components (these could be molecular fragments, atoms or even electronic level components) that are the key players in the performance determining mechanism. These are the design-relevant building blocks. Determine whether the performance indicators of interest could be directly estimated from a mathematical model of the mechanism at a small or reasonable computational expense. a) If such a description exists then this would serve as our forward model. b) If not, then an additional statistical model would be required that

relates the "outputs" of the mechanism to the performance indicators of interest. Additionally when several such "outputs" could be extracted from the fundamental model, the optimal set will be determined based on accurate correlation to eventual performance measures.

In cases where no mathematical description is feasible or inexpensive then a purely correlative model could be used to relate performance measures to the functionalities of the design-relevant building blocks. In this case study, an evolutionary search procedure is employed to locate additive candidates that are expected to satisfy pre-set performance criteria. The solution to the

335

inverse problem is one of combinatorial optimization and all techniques usually applied to the formulation and solution of this problem are applicable here as well. However, we anticipate the hybrid model with a nonlinear phenomenological component as well as a nonlinear statistical model (including neural networks). This makes for a non-linear combinatorial optimization problem involving the use of a black-box objective. Widely used techniques such as knowledge-based enumeration [9] , graph theoretic approaches [10] and mathematical programming [11,12] have limited effectiveness in this situation. Stochastic methods such as random generation, simulated annealing and genetic algorithms (GAs) [13,14] are likely candidates because they function independently of the nature of the objective function. Previous work in computer-aided molecular design has also demonstrated GAs to be flexible in capturing the rich underlying chemistry [15,16]. Moreover, they are robust to non-linearities and hence powerful procedures for global search.

Evolutionary algorithms are based on the Darwinian model of evolution. A random change followed by natural selection is the principle behind successive screening of populations of solution candidates [13]. At every stage new solutions are created from the current population using genetic operators. The genetic operators provide the moves in the structural space of possible designs. A mutation operator replaces one component of a solution with a randomly chosen but appropriate choice. A crossover operator is applied on two solutions at once, where a randomly chosen component of one solution is exchanged with an appropriate component on the other solution. Every solution in the population is evaluated using the forward model to determine its performance quality or fitness. The next generation is selected from the current set in a random but fitness proportionate manner. By maintaining a balance between fostering good building-blocks within the population and introducing random variations from without, the GA provides an environment for optimization to occur. The steps involved in the inverse design phase of the CAMD are as follows.

15.3.2 I n v e r s e P r o b l e m

1. Determine the criteria for the performance indicators for additive candidate selection.

2. Identify the choices available under each type of the design-relevant building block. a) Additionally determine if the above building blocks themselves could be

put together from more fundamental structural units. b) Choose (usually) the smaller of the two sets. This will be the base group

set for the evolutionary search algorithm. 3. Identify the rules that govern the construction of the final product from the

design-relevant building blocks and if required the construction of the buidling blocks themselves from more basic units. These will determine the constraints to be imposed on the genetic operations.

336

4 Determine suitable genetic operators. These should include at least the following types. a) A random candidate generator b) Mutation: A move that replaces a single component of a candidate

additive with a suitable choice. c) Crossover: A move involving two additive candidates where one or more

components of one additive are exchanged for one or more of the other to produce two offspring.

The details of each step with regard to the specifics of the fuel-additive design problem (Figure 3) are taken up in the next section.

Figure 3: Overall CAMD formulation for additive design to minimize IVD

15.4 S O L U T I O N STRATEGY

15.4.1 F o r w a r d P r o b l e m

As described in Section 15.2, the engine testing for intake-valve deposit determination is a long and expensive process. The lack of controlled experiments and a dearth of measured fundamental parameters make a purely first-principles forward model extremely difficult to determine and impossible to verify. On the other hand noise,consistency and sparsity problems in the data sets, make purely statistical or correlative approaches inaccurate and unreliable from a mechanistic viewpoint. However, formulation chemists in the industry often work with the same quality and amount of data and are able to produce marginal to significant improvements

337

in the formulations. They use a hybrid approach that combines the best possible phenomenological description of fuel-additive performance with the intangible but important ability to intuit performance from structure.

The essence of this work and indeed of any CAMD for engineering materials is to rationalize this approach. We do this by hybridizing the fundamental and knowledge-based components using a first-principles+statistical/neural- network regression approach. We bind this forward modeling technique to the process of formulation by the identification and use of design-relevant building blocks at the outset. Before any models can be determined, we need a mechanistic description of the chemistry of deposit removal. Figure 4 shows as a schematic, how a fuel-additive molecule works in a fuel+oil milieu. The fuel- additive is sustained in solution by tail-like components in its structure, which we simply refer to as the tail. These polydisperse structures are strongly bound to a "deposit-attracting" core through a chemical block called the l inker. The core of the molecule that performs the activity of scavenging the deposit forming precursors is referred to as the "head". In effect, we have three major "functional" components in the fuel-additive molecule that serve as the design relevant building blocks. Any changes to the structure of the fuel-additive molecule, be it small substitutions or large structural changes affect the overall performance of the additive via their effect on the roles of the head, linker and tail. The head, linker and tail components act as performance consolidators of all the structure/chemistry of the additive molecule. As shown in Figure 4, the stability of the additive is the chief performance determining property of the additive in the fuel. As the additive breaks down in the fuel milieu, the stability of the additive degrades and so does its ability to sustain the deposit-scavenging function. The primary functional descriptor capturing the interacting roles of the head, linker and tail is the time varying stability of the additive molecule in the gasoline. The kinetics of additive degradation can be modeled as a series of differential equations on the "effective additive length". As mentioned previously, the tails are polydisperse and hence they have a distribution of lengths. The "effective additive length" is the total length of the additive molecule including the head, linker and the tail. With first-order degradation kinetics (a reasonable assumption), the concentration of effective additive lengths is given by equation (1).

d ~ N kH~XH+L~

aft ~=o

dXH+L = --kHXH+ L dt

dXH+L+I - - - ( k H

dt

XH+L+j - - - { k H

dt

dXH+L+N _ - - - { k H

dt

N

+ kLXH+L+ 1 + ~j~tXH+L~ i--2 N

+ +

i--2 j N

+ k L + ~}XH+L+ j + ~XH+L. . I<j<N-1 i=l i=j+l N

(1)

338

Xi is the concentration of additive of length 'i' units. The k's are the different ra te constants for bond breakage. The variables in the above equation are explained in Figure 4. Using this model, the distribution of the s t ructures of the additive can be determined as a function of time.

Figure 4: Function of additive molecules in the fuel

The objective of the first-principles modeling is not so much to capture in exact quant i ta t ive detail the different mechanisms involved, but to acknowledge the physics behind the relevant mechanisms in order to get a rank-order ing of the performance of different additives. Figure 5 shows some generic s t ructures of additives. Additives may contain l inkers and heads with more than one site for tails and linkers respectively (Types II & I I I in Figure 5). Also more than one tail might appear in succession along a single branch of a l inker (Variant 2 in Figure 5). The concentration of "effective lengths" of the additive s t ructures with more than one tail in parallel is obtained as a sum from a joint distr ibution of several additive s t ructures with single tail s t rands of various lengths. Once this is done, the t ime-concentrat ion behavior of the addit ives with different tail lengths is obtained as a solution to the above set of differential equations. The degradation of the additive is assumed to take place between units pr imari ly across the weakest links between them. This is s imilar to pyrolysis and the activation energies for this breakage is de termined in an analogous manner.

339

Figure 5: Generic topology and connectivity of additive molecules

The stabil i ty or solubility of a solute (additive) in a solvent (fuel) is de termined by the relative cohesive energy densities of the solute and the solvent. It is one of the most widely used indicators of solubility/miscibility and is character ized by the Hildebrand pa ramete r [17]. This is a measure of the in ternal -energy density and represents the amount of energy required to move two-molecules of a species to infinite separat ion in solution. The extent of solvation of a substance A in a solvent B depends on how close the cohesive energies of the two substances are. The condition for solubility (and hence stability) becomes

la~dd.,ve (t) - aF~e, I ~ ~'soluble (2)

)Lsoluble is a pre-set solubility bound, which is usual ly less than 5 [18]. ~Additive and 5Fuel a r e the Hildebrand paramete rs of the additive and fuel respectively. As the additive degrades, its s t ructure and therefore its Hi ldebrand pa rame te r value changes over t ime on fuel. For a given value 5Fuel and Esoluble, the exact fraction of additives tha t meet the solubility criteria can then be determined by applying (2). Since the additive needs to remain solvated as long as possible in order to continue removing deposit pre-cursors, the t ime varying solubility measure becomes a key indicator of its stabil i ty and hence its overall performance in the fuel. We define the amount of additive tha t is solubilized in the fuel at any point in t ime as the amount of active additive in the fuel.

340

The Hildebrand pa ramete r can be es t imated by group contribution methods of which the modified Hansen 's method [18,19] is the most suitable for this case. Hansen 's group contribution method determines three separate contributions to the Hildebrand pa ramete r or the cohesive energy of a molecule. These are due to dispersion, polarity and hydrogen bonding. A molecule is first split up into a group of functional groups tha t have fixed contributions to each of the above terms. The molar volume of each functional group is also es t imated from group contribution (of Hansen [18]). The Hildebrand pa ramete r for a molecule containing Nf functional groups is es t imated by equation (3) [18,19]. 5d, 5p and 5h are the dispersion, polar and hydrogen bonding contributions for the entire molecule, and Flzd, F~zp and Ulzh are the functional group contributions to each of those three terms respectively. V i is the group- contribution to the molar-volume of the molecule, from functional-group 'i'. 5i are the Hildebrand parameters es t imated for each species and 5T is the total Hi ldebrand pa ramete r value of all the additive in the fuel and X i is the mole- fraction of an additive molecule with an effective length of 'i' units. The mole- fractions are also distributions (varying with effective length) changing over t ime due to the degradation reactions. At a given ins tant of time, one can then determine the additive length distribution curve. Each point on this curve refers to an "effective" additive s t ructure whose components in te rms of head, l inker and number of tail units can be determined. From the above group contributions the Hildebrand pa ramete r (cohesive energy density) of each additive molecule in the distribution at a given ins tant in t ime can be calculated. Using the criteria of equation (2) then, for a given )~oluble the amount of active additive in the fuel can be determined as a function of time. This is the first-principles component of the forward model.

V i=1 (3)

i=N

aT -- Z Xi (t)ai (4) i=l

The first-principles model predicts the fuel activity as a function of time. What remains , is to correlate this activity vs. t ime curve to the eventual performance indicator, which in this case is the amount of intake-valve deposit. The forward branch of Figure 1 i l lustrates how one might accomplish this. The functional descriptors obtained from the first-principles model will

341

have to correlated against the the IVD data from the databases. Both linear and non-linear regression models should be explored. A point to note is that the ~,soluble parameter can be varied within its normal bounds to obtain different curves (for every datapoint in the database) and the data picked from the corresponding curves for all the datapoints can then be regressed against the intake-valve measurments. The ~soluble value corresponding to the best regression model is then chosen as the optimal bound. For this case study, neural networks [22] are the nonlinear method of choice. They are relatively function free and easy to implement. Moreover, we do not have large fundamental constraints on the regression models and this makes the situation further attractive for the use of neural networks. Linear models as well as different architectures of neural networks are implemented and the optimal one determined based on accuracy in prediction The results are discussed in the Section 15.5.1.

15.4.2 I n v e r s e P r o b l e m

The design problem involves the construction of optimal fuel-additive molecules given desired IVD requirements. For reasons outlined earlier, an evolutionary search is employed to achieve this. Unlike deterministic approaches like mathematical programming, for instance, that contain a formulation phase and a solution phase, evolutionary approaches usually contain only a solution phase. While the details specific to the problem are explicitly modeled in the formulation phase of a math programming approach, these have to be dealt with in the solution phase for an evolutionary method. This implies that each evolutionary algorithm is unlike another one applied for a different domain in most aspects. However, major components of evolutionary search procedures have some common spirit even across different application domains. These aspects were outlined previously. For the fuel- additive design problem the components of an evolutionary design procedure are customized as follows.

Representation: With the identification of the design-relevant building blocks this step is straightforward. We choose to represent an additive molecule to contain a head, linkers and tails. But there are constraints on how these components are put together based on their chemical make up as well as rules of synthesizability. To accommodate these rules, the head, linker and tail representation is recast into an object-oriented representation as shown in Figure 6. Under each object category information about generic object properties such as compatibility lists, connectivity and group contributions are retained. In addition, when these objects are connected to form additive molecules, specific adjacency information is also retained in the object structure.

Feasibility Rules: Chemical and synthesizability rules for the design problem basically fall into two categories: - (i) Disallowed combinations of head-linker

342

and/or linker-tail pairs. (ii) Feasible construction based on existing connectivity. The second category is t ransparent ly imposed via the object- oriented structure. Since we keep track of the actual connectivity (valence) of the head, linker and tail as well as the connectedness information in the molecule, feasibility can be enforced during generation, mutation and recombination (crossover). The first rule is also complicated by the fact that a head may contain more than one type of site, not all of which is compatible with a given linker. Imposing the first rule during generation is straightforward. However the genetic operators are modified from their generic counterparts to seek feasibility as well as recover feasibility after operation.

Figure 6: Object-oriented representation for head, linker and tail in evolutionary search

Genetic Operators: Mutation and Crossover are the genetic operators of choice for this problem. Unlike the operators widely used for evolutionary methods in l i terature, we can customize them to better reflect the chemistry of the additive design problem. The first step is to bind these operators to the component being operated upon. For each of the four components of head, linker, tail and branch (a linker+tail path when several such structures are connected in parallel to the head), one mutation and crossover operator is created. Within each of these operators, we also need to reflect the feasibility rules to ensure legitimate product candidates. The outline of feasibility

343

enforcement within the operators is similar across operators and only two will be discussed here.

Branch Crossover: The crossover operation involving one branch of a fuel- additive molecule with a branch from another molecule is shown in Figure 7. The constraint to be considered is one of head-linker compatibility. The linker at tached to one branch should be compatible with the l inker site on the head component of the other molecule. This is enforced as shown in Figure 7. During the pairing phase, after one parent has been chosen, the second parent is chosen so tha t it contains at least one branch tha t is compatible with a site on the head of the first parent. As shown in the example in Figure 7, Branch-1 is compatible with both sites on the head of its parent (Parent-I), as well as the single site on Parent-2. On the other hand, the single branch of Parent-2 can go only on Site-1 of Parent-1. A simple crossover of Branch-1 of Parent-1 with the single branch of Parent-2 results in infeasibility. This is averted by switching the Branch-1 of Parent-1 onto Site-2 and moving the crossed over branch from Parent-2 onto that vacant site. Similar operations can be (and are) defined for the other crossover operators.

Figure 7: Branch crossover operator

Linker Mutation: This operation replaces a l inker on a chosen parent with a compatible l inker from the base set. This is done in two steps. First, a l inker is chosen at random from a set of linkers compatible with the head-linker a t t achment site. This linker is used as the replacement if a sufficient number (as many as the branching of the new linker) of tails on the removed linker are compatible with the sites on the new one. If not, different rear rangements

344

of the tails are explored to identify a compatible configuration. If no such arrangement can be found, this linker is dropped from the list of compatibles (only for this instance of the selection procedure) and a different linker component is chosen. If the chosen linker has more branch points than the linker that it is replacing, the additional branches/tails are chosen at random, but compatible with the vacant attachment sites on the linker. If the chosen linker has less branch points, the surplus tails after feasible assignment are discarded.

The crossover rate is set at 0.60 and the mutation rate at 0.40. The rate for each individual type of crossover and mutation is the same within the class. The large mutation rate is warranted by the non-binary representation.

F i t n e s s F u n c t i o n : The hybrid forward model can be de-coupled to design either for maximum additive stability (output of first-principles model) or for low intake-valve deposit performance (final output of the hybrid model). The fitness function returns a value characterizing the quality of the solution which is usually a number between zero and one, the larger being more desirable. For the solubility objective, the fitness function is directly defined as the fraction of the initial amount of additive that is active (at a pre- specified time t) based on the definition of activity given earlier. For the IVD based objective, the fitness function is defined as follows

F = e-3('v~ ,~s-IvO,.~, ) ; IVD,~e > IVl~i~,

F=I; IVDre . f<IV~, (5)

An a value of 0.002 is used for this case study. The variuos steps involved in the evolutionary design algorithm are shown in Figure 8. The initialization for the algorithm is a single randomly generated (but feasible) lead structure. Copying the single lead structure as many times as the population size gives the very first generation. The fitness of each function in the population is evaluated and the population is sorted accordingly. The top few solutions in each generation are retained into the next generation. A fitness proportionate selection is employed to select parents who then undergo mutation and crossover to form the rest of the population. The cycle is continued until some pre-specified termination criteria are met.

15.5 R E S U L T S AND D I S C U S S I O N

The database for the case study consists of engine test results (provided by the Lubrizol Corporation) based on three different engines, which are referred to as the BMW, Honda and Ford databases. The intake-valve deposit measurement is the performance indicator of interest here. For each test, the databases also provide the structure of the additive package used and some characteristics of the gasoline. The solubility distribution of additives of

345

different effective length as a function of time is determined from the first- principles model. Using the fuel character, a reasonable fuel cohesive energy density is determined and the active additive concentration profile is calculated. Both linear and neural-network models are examined. As mentioned earlier the stability bound (~,soluble) is adjusted to get the best correlative curve. The amount of active additive remaining in the fuel at different times is used as the input to the regression model.

Figure 8: Evolutionary Design of Fuel Additives

15.5.1 F o r w a r d Mode l

Figure 9 (reproduced from Sundaram et. al. [20]) shows the cohesive energy density profile (characterized by the Hildebrand parameter) of different additives from the BMW database. It is clear that the fundamental model is quite successful in capturing significant stability differences between the additive packages. Fraction 2 is a Type III structure and Fraction 1 is a Type II structure. The larger non-polar (more soluble) contribution from the additional tails lead to a smaller initial value of the Hildebrand parameter for Fraction 2. But as the tails start degrading over time, the presence of a highly polar additional linker in Fraction 2 has a significant contribution to the final Hildebrand value making it less soluble. Two additives that perform similarly in a given fuel over a particular time scale may indeed have different stability,

346

if ei ther the fuel or the time scale of interest is different. The differences between Package 2 & Package 3 are primarily due to the different polar contributions of their core fragments. The first-principles model as demonstrated by the above results is able to distinguish between the chemical nature of the sub-structures in different additives and pick out the stability contributions from the topology or connectivity of the additive molecules.

Figure 9: Solubility profiles for different packages in the BMW database (Reproduced from [20])

As a second level of validation of the model, we try to ascertain the importance of the stability argument in the eventual intake-valve deposit performance of the additive. This is an under pinning assumption for the regression model. To this end, the additive packages from the FORD database (refers to the type of engine test used) are assigned relative quality measures. An expert based on their previous experience with these additive packages assigned these relative quality indicators. The solubility predictor was compared against the expert assigned measure. The resulting correlation had an R 2 of 0.965. This demonstrates quite clearly that stability is a significant discriminator between additives.

347

Table 1: Comparison of performance of different regression models for IVD prediction from stability descriptors.

Database

BMW QS~

Model

BMW BMW BMW Honda Honda Honda Honda Honda

Summarized

QSAR (PEA)

Variables

36

Projections

None None

RMSE (mg) in Cross, Validation

214 172

QSAR (PLS) 6 None 142 Solubility(NN: 4 Radial Basis) 1 None 124 Solubility (Linear) 30 PLS (3) 33 Solubility (NN: 2 Tan-Sigmoid) 30 None 35 Solubility (PLS-NN: 2 Tan-Sigmoid) 30 PLS (4) 30.7 Solubility (PLS-NN: 3 Tan-Sigmoid) 30 PLS (3) 31.2 Solubility (PLS-NN: 4 Tan-Sigmoid) 30 PLS (3) 31.4

from Sundaram et. al. [20]

The regression models are developed to correlate the activity profile of the additive to the intake-valve deposit measurement . Both linear and neural network models can be constructed for this purpose. The models are developed i.e. their parameters are determined during the t ra ining phase, where only a part of the data set is used for this determination. In the test ing phase, the model is presented with rest of the data that was not used in training. The predictions in the testing phase are compared against the actual measured values and the error is taken to be representat ive of the quality of the regression model. In bootstrapped cross-validation mode, several different part i t ions of the data are made (into training and testing subsets) and the models t ra ined and tested on each partition. The overall average error during test ing across all parti t ions is then reported as the quality of the model and the best of several competing models is chosen [21].

Neural networks are function free nonlinear models. But their architectures in terms of the number of neurons, layers and the t ransfer functions can be varied. By varying these parameters, different architectures of the s tandard feed-forward network [22] can be examined. In addition the so-called PLS-NN architectures are also examined. Briefly, these are models where the neural networks are fit to the residuals of successive linear models extracted by applying the partial least squares (PLS) technique [23]. The PLS approach involves successive projections of the input variables into l inear combinations based on maximum correlation with the output. For further details the reader is referred to the different papers in this area [23,24,25].

The results for the forward modeling effort are summarized in Table 1. The first column refers to the database. The second column indicates the model type. The QSAR models refer to quanti tat ive structure activity relationships and the ones mentioned in the second column refer to those in use at Lubrizol. The solubility models are the ones based on the functional stability

348

description of the molecules. The second column also details the model type (such as linear, NN, PCA, PLS etc). The third column shows the number of variables in the model (before projections). The fourth column indicates the projection type and the number of projections eventually used in the modeling. The last column is the RMSE (in mg) based on cross-validation over 10 different partitions of the data sets. For the BMW database, the quality of the data is not very good. The data are engine test results over ten years and the experimental errors were quite large [26]. Adding to this is the small size of the dataset (92 points). Even with these limitations the hybrid model clearly outperforms the best QSAR models and the error reduction is quite significant. For the HONDA database the models perform up to the quality of the data. Linear models from the HONDA database perform quite closely to the neural-networks. However, they were more sensitive to data partitioning and so the neural netowork models are favorable in this regard. The PLS-NN models perform a little better than the standard feed-forward architectures and are the model of choice for the database.

The hybrid first-principles + regression approach to forward modeling is quite accurate given the sparseness of the data and large experimental errors. Additionally, this approach provides an intermediate indicator of stability (the amount of active additive) which by itself can be used as a relatively easiy to measure performance and modeling standard.

15.5.2 Evolut ionary Design of Fuel-Additives

Des ign for maximum solubility

Table 2 shows the results of the evolutionary search with the solubility as the objective. The objective is to find the additive molecule(s) that are most soluble in a give solvent (characterized by a solvent Hildebrand parameter shown in column one) at a given instant of time. The design procedure was allowed to run for 25 generations with a population of 25 molecules. The base set of design-relevant groups consisted of 25 heads, 9 linkers and 3 tails. The time instant of interest was then varied, keeping the fuel Hildebrand parameter fixed to determine a different set of designs for maximum solubility.

Table 2 shows the solubilized fractions of the additive for four different solvents (fuels) and 3 different time instants (z =1,5 and 10). The solubility Vs time curve for the additive molecules identified to be highly soluble at x =1 are then used to determine their solubility at x = 5 and 10. This is reported in columns 3(A) and 4(A) in Table 2. The solubility values are then compared with those of the additives designed for maximum solubility at times x = 5 and 10. These values are reported in columns 3(B) and 4(B) of Table 2. It is clear from the table that the additive design is sensitive to the nature of the fuel and the time on stream. Additives that perform well at short times of contact need not do so for longer times and the difference is larger as the fuel becomes

349

increasingly polar (larger ~). The results demonst ra te tha t the evolutionary algori thm is successfully exploiting the differences in short and long te rm behavior between different additives through the use of the design-relevant building blocks.

Table 2. Evolutionary design of additives for stability: Results

5 (MPa in) [Fuel]

x=l x=5 x=10 2(A)

Best Design at I;= 1

3(A) Solubility (at ~=5 of Best Design at

t=l)

3(B) Best

Design at x=5 4(A)

Solubility (at "c-10 of Best Design at

~=1)

4(B) Best Design

at "~= 10

19 0.96 0.77 0.79 0.55 0.64] 21 0.96 0.82 0.85 0.7 0.73 23 0.92 0.67 0.72 0.45 0.61 25 0.42 0.31 0.41 0.08 0.5

D e s i g n f o r m i n i m u m I V D

The forward model for this run consisted of the complete hybrid model tha t predicts the expected intake-valve deposit given the s t ructure of the additive, the fuel s t ructure or character and the operating conditions. The regression model used is the PLS-NN model t ra ined on the HONDA database. The objective now is to find designs tha t are predicted to produce an IVD close to or less than 10 mg. The fitness function of Equation (5) is used here. Again the evolutionary algori thm is allowed to run for 25 generat ions with a population size of 25. The rank, fitness, IVD and s t ructural details of some of the top ten additive molecules are reported in Table 3, for three different runs. For propr ie tary reasons, the s t ructures are not revealed. However the s t ructure is described in te rms of the generic structural/connectivity definitions used in Figure 5. Some s t ructures contain commonly used components based on the additive molecules in the databases. However the combinations (of the head, l inkers and tails) are novel, leading to different predicted performance. The best design giving a predicted IVD of 8.9mg (structure 3-A) contains commonly used components and in fact is identified to possess good synthesis potential .

Table 3. Optimal solutions from the evolutionary design of fuel-additives for intake-valve deposit performance

Run Identifier Fitness Rank

1-A 0.997 1

1-B 0.996 2

1-C 0.993 6

2-A 0.999 1

2-B 0.989 2

2-C 0.983 4

3-A 1 1

3-B 0.994 2

3-C 0.993 3

IVDPredicted (mg) Type

11.4 Type III

11.5 Type III

12 Type II

10.1 Type II

12.6 Type II

13.2 Type II; Variant 2

8.9 Type III

11.9 Type III; Variant 2

12.1 Type III

Comments

Novel Structure; Components not usually used in database Novel Structure Variant of structure in BMW database Novel Structure; Different from l- A; Contains a rarely used linker type

Slight variation of a commonly found structure in the databases

A two tails in series variation of 2- B

Novel Structure; Distinct from both 1-A & 2-A; High synthesis potential

Variation of 3-A Different core compared to 2-B; An additional branch

351


Design of engineering materials involves design for performance instead of inherent properties. A CAMD procedure for engineering materials should capture the phenomenological underpinnings of how molecular structure interacts with the environment leading to performance. The linchpin in this approach is the identification of structural aggregates that consolidate the influential effects (on performance) of smaller units. They act as sensitive design-relevant building blocks and intimately tie the forward and inverse problems together. We have demonstrated through this case study, how design-relevant building blocks can be identified for a real industrial problem in fuel-additive design. A hybrid model based on functional descriptors derived from these building blocks was implemented. The model captures the chemistry and physics of how additives sustain their deposit removal ability while maintaining robustness to noise and sparseness in the data. An evolutionary search procedure was implemented to determine optimal additive molecules that meet desired performance criteria. The design algorithm was customized to handle inherent constraints and hence avoids some of the feasibility pitfalls of stochastic algorithms. The inverse design algorithm was shown to locate optimal solutions that also possessed a high potential for synthesis.

In this case study, we concentrated on the stability of the additive as the function influencing its deposit removal capability. While this is the predominant effect, other factors such as the susceptibility of the core to nucleophilic attack, the tendency of the additive to degrade rapidly in the combustion chamber (just enough stability), for instance, are also important to lesser extents. These effects could be modeled separately through relevant mechanistic descriptions. The point to note here is that the design-relevant building blocks are the same for these models also. But the contributing sub- units of the building blocks and their functional interactions will be different. The building blocks allow for different functional descriptions while retaining the same design-level structural abstraction.

The hybrid model for IVD prediction was a black-box model due to the use of a neural network. Even the phenomenological component of the hybrid model turned out to be nonlinear in this case. However in other design domains where the forward models are linear or transformably so, suitable deterministic techniques should be used. Even in this case study other stochastic methods such as simulated annealing [27] that have been successfully applied to other CAMD problems [28] could be used to tackle the inverse problem. The essential idea and indeed the power behind this CAMD approach is the identification of the most sensitive set of design/decision variables based on a first-principles understanding of how structure relates to performance. Then the best search procedure can be customized to navigate the designs cast in that decision space.

352

Acknowledgements

We would like to thank the Lubrizol Corporation, Wickliffe, OH for their support of this work as well as the data. Thanks are also due to Dr. Dan Daly at Lubrizol for numerous discussions and help in understanding fuel- additives.

15.7 REFERENCES

[1] Chan, K., Computer-Aided Molecular Design Using Genetic Algorithms. , Ph.D. Thesis, Purdue University, 1994. [2] Kalghatgi, G.T., Deposits in Gasoline Engines- A Literature Review. SAE Technical Paper Series: Lubricants and Fuels, 1990. 99:4 (902015): p. 639- 667. [3] Lacey, P.I., Kohl. K.H., Stavinoha, L.L., and Estefan, R.M., A Laboratory- Scale Test to Predict Intake Valve Deposits. SAE Technical Paper Series: Lubricants and Fuels, 1997. 106:4(972833): p. 880-891. [4] Grant, L.J. and R.L. Mason, SwRI-BMW N.A. Intake Valve Deposit Test- A Statistical Review. SAE Technical Paper Series: Lubricants and Fuels, 1992. 101:4(922215): p. 1221-1230. [5] Graham, J.P. and B. Evans, Effect of Intake Valve Deposits on Driveability. SAE Technical Paper Series: Lubricants and Fuels, 1992. 101:4(922259): p. 1231-1245. [6] Houser, K.R. and T.A. Crosby, The impact of Intake Valve Deposits on Exhaust Emissions. SAE Technical Paper Series: Lubricants and Fuels, 1995. 103:4(922259): p. 1432-1451. [7] Office of Mobile Resources, Final Rule: Certification Standards for Deposit Control Gasoline Additives, 1996, US Environmental Protection Agency. [8] ASTM D5500-94. Standard test method for vehicles evaluation of unleaded automotive spark-ignition engine fuel for intake valve deposit formation, Section 5, Vol 5.03, American Society for Testing and Materials, Jan 1995. [9] Gani, R. and Fredenslund, Aa., Computer Aided Molecular and Mixture Design With Specified Property Constraints, AIChE J., 1991, 37(9):p. 1318 [10] Skvortsova, M.I., et. al., Inverse Problem in QSAR/QSPR studies for the case of topological indices characterizing molecular shape (Kier Indices), J. Chem. Inf. Comput. Sci., 1993, 33:p. 630-634. [11] Churi, N. and Achenie, L.E.K., Novel Mathematical Programming Model for Computer Aided Molecular Design, Ind. Eng. Chem. Research, 1996, 35:p. 3788-3794. [12] Maranas, C.D., Optimal Computer-Aided Molecular Design: A Polymer Design Case Study, Ind. Eng. Chem. Res., 1996, 35:p.3403-3414. [13] Holland, J.H., Adaptation in Natural and Artificial Systems. 1975, Ann Arbor: University of Michigan Press. [14] Goldberg, D.E., Genetic Algorithms in Search, Optimization and Machine Learning. 1989, Reading, Mass.: Addison-Wesley.

353

[15] Venkatasubramanian, V., Chan, K. and Caruthers, J.M., Genetic Algorithmic Approach for Computer-Aided Molecular Design, in Computer- Aided Molecular Design. 1995. p. 396-414. [16] Venkatasubramanian, V., et al., Computer-aided Molecular Design using Neural-Networks and Genetic Algorithms, in Genetic Algorithms in Molecular Modeling, J. Devillers, Editor. 1996: London. p. 271-302. [17] Hildebrand, J.H. and R.L. Scott, Regular Solutions. 1962. [18] Barton, A.M., CRC Handbook of solubility parameters and other cohesion parameters. 1991: CRC Press. [19] Meusberger, K.E., Pesticide Formulations. Am. Chem. Soc. Symp. Ser., 1988. 371: p. 151. [20] Sundaram, A., et al., Design of Fuel Additives Using Neural Networks and Evolutionary Algorithms, AIChE Journal., 2001, 47(6): p. 1387-1405. [21] Schenker, B. and Agarwal, M. Cross-Validated Structure Selection for Neural-Networks. Computers. Chem. Eng., 1996.20(2): p. 175-186. [22] Haykin, S., Neural Networks: A Comprehensive Foundation. 1999. [23] Andersson, G., Kaufmann, P. and Renberg, L. Nonlinear Modeling with a Coupled Neural Network - PLS Regression System. Journal of Chemometrics, 1996. 10: p. 605-614. [24] Geladi, P and Kowalski, B.R. Partial Least-Squares Regression: A Tutorial. Analytica Chimica Acta, 1986. 185: p. 1-17. [25] Qin, S.J. and McAvoy, T.J. Nonlinear PLS Modeling Using Neural Networks. Computers Chem. Eng., 1992.16: p. 379-391. [26] Arters, D.C., E.A. Schiferl, and D.T. Daly, Variability of Intake Valve Deposit Measurements in the BMW Vehicle Intake Valve Deposit Test. SAE Technical Paper Series, 1997. SP-1277(971723): p. 67-80. [27] Kirkpatrick, S., et. al., Optimization by Simulated Annealing, Science, 1983, 220:p. 671-680. [28] Marcoulaki, E.C. and Kokossis, A.C., Molecular Design Synthesis using Stochastic Optimisation as a Tool for Scoping and Screening, Computers Chem. Engng., 1998, 22(Supplement): p. $11-$1.

P A R T I I I : Computer Aided Product D e s i g n

The first two parts of the book focussed on the problem of computer aided molecular design (CAMD). The broader problem of design of new materials or products as against molecules is highlighted in this final part of the book. Some of the new frontiers for the computer aided product design (CAPD) problem are presented. Finally, the outstanding issues and challenges are discussed.


Chapter 16: Challenges and Opportunities for CAMD

R. Gani, L. E. K. Achenie & V. Venkatasubramanian

The problem of molecular design is only a special case of the much broader problem involving design of new materials, formulations or structured products as against simply new molecules (or mixtures of molecules). The expanded problem of computer-aided product design (CAPD) can thus be stated as the determination of the optimum material, s tructured product and/or formulation to meet a given set of design objectives.

The chapters in Parts I and II of this book have primarily been concerned with molecular (and mixture) design of small molecules. Chapters 5 and 13 have discussed the design of polymers, but they have concentrated on bulk properties and not on polymers properties dependent mainly on differences in the polymer structure at the mesoscopic and/or microscopic level. The methods and tools described in this book can, however, serve as the basis for solving problems not covered in detail in the earlier chapters. For example, using higher-order groups, topological indices and/or higher level molecular structural representations, the methods described in chapters 2-7 can easily be adapted to design large, complex molecules that are usually isomers or multiple conformations of a specific molecular type. Use of higher-level molecular representations will also require the use of property estimation techniques that employs such molecular structural information.

The objective of this chapter is to highlight some of the challenges and opportunities related to problems not covered by the earlier chapters.

16.1 C H A L L E N G E S

Klientjens (1999) provided a useful list of challenges in terms of s t ructured material products that adapt their properties to suit their environment or that remember their previous shape. As examples, Klientjens lists some target functions (needs) for these structured m a t e r i a l s - materials that contract like a muscle, materials that change in color upon a change in thermodynamic conditions, materials whose viscosity changes when introduced into an electromagnetic field and many more. Realization of these and other challenging products could be

358

achieved by addressing (finding answers) to the following questions and other related questions:

�9 Can we manipulate the structures of our products at the micro-, meso- and/or macroscopic levels in order to give the product a desired functionality?

�9 Can we produce a desired chemical/biochemical/agrochemical product by finding the optimal reaction and processing path?

�9 Can we a priori identify the products for which an appropriate processing route is achievable?

�9 How can we validate and test the desired functional properties (such as controlled release) of the product?

�9 How can we enhance the functional properties of products? �9 How can the optimal interactions between product and process

design be explored? �9 How can we identify the ingredients (additives) that when added to

a product (such as flavors, paints, pesticides, etc.) enhances the functional properties of the product & formulation?

16.1.1 F r o m M o l e c u l e s to M a t e r i a l s

Material design differs significantly from the more traditional design problem. In traditional design, the component behavior is often well known or can be described by relatively simpler models. On the other hand in the realm of material design, the primary challenge lies in the determination of the model of the material behavior. Furthermore, the mult i tude of possible chemical structures or formulations results in a combinatorially complex design space. Notwithstanding these challenges, present day material design enjoys the advantage of availability of rich pools of material data due to the advent of high-throughput experimentat ion (HTE). Not only do there exist large collections of data but also new data is being generated now at incredible rates. Consequently, another key issue in material design becomes the suitable extraction of knowledge from such vast reservoirs of data and its appropriate exploitation in the overall design exercise. At the same time, despite the high-throughput screening tools, the sheer complexity of the design space necessitates experimental design so as to intelligently focus the data collection process towards the promising regions of the design space. The need of the hour for computer aided product design is then a rat ional framework that can address all the issues mentioned above and seamlessly integrate the processes of forward model development via knowledge extraction, solution of the inverse problem and design of experiments. Caruthers et al. (2002) discuss such a framework in the domain of catalyst design. The ideas presented by Caruthers and coworkers are applicable to the generic material design context and we briefly relate the same here.

359

Hybrid Modeling Approach

The framework integrates the computer-aided knowledge extraction process with HTE and expert knowledge so as to fully exploit HTE. It is important to note that the vast reservoirs of available data simply offer information and do not necessarily present it directly in the form of corresponding knowledge. In order to extract such knowledge from HTE data, the framework utilizes advanced models and novel software architectures that strive to approximate the thought process of the human expert. The overall material design problem is again viewed to consist of two components, analogous to that in the case of molecular design as was mentioned in the very first chapter of the book. This is i l lustrated in Fig. 1.

Forward Problem

"~ Predictive Model I'

I " l i

Material Composition

Operating Conditions Material Performance

Design ] Inverse Problem

Figure 1: Components of the material design problem

The forward model is used to connect the material composition and/or high-level descriptors of the composition to the performance of the material in the application of interest. An inverse model relates the performance to the desired composition or formulation. By definition, design is the solution of the inverse model. Although the inverse solution is often the primary technological objective, a rational design process requires good, robust, forward models. In turn, in order to develop good forward models it is imperative to possess in-depth knowledge about the system of interest. However the development of the model presents some unique challenges. Often, it is intractable to develop first principles models alone that connect the material composition all the way to the material performance. At the same time, while a large and diverse data set is essential, purely data-driven models are also usually inadequate. These difficulties necessitate the use of advanced hybrid modeling techniques where first principle models are used in concert with data driven models, like neural networks, and expert knowledge. Fig. 2, taken

360

from Caruthers et al. (2002), schematically illustrates the concept of such a hybrid, integrated modeling framework.

Figure 2: Schematic of modeling architectures. (a) Traditional approach where models do not interact and (b) new hybrid approach where models

work in concert.

The most complex material design problems are often those where the underlying systems either involve reactions, or have time-evolving performance properties of interest, or both. For instance, biological systems form an especially important class of such complex systems involving reactions where metabolic pathway modeling becomes important. To model these complex reactive systems, more sophisticated knowledge architectures are required. In general, for a chemical or biological system, a kinetic model is required to model the reactions. Then

361

the parameters of the kinetic model need to be determined from experimental data assuming a particular reaction mechanism or metabolic pathway. To facilitate the process, the overall modeling effort is s tructured in two parts. First a fundamental model of the system based on physics, chemistry and/or biological knowledge about the system is developed that would provide suitable descriptors of the system. These descriptors would then be used to determine the parameters of the kinetic model, which is the second part of the overall forward model. This two-part forward model approach is depicted in Fig. 3, which is a generic version of a figure from Caruthers et al. (2002). One may be tempted to eliminate the fundamental (chemistry, physics or biology) models as well as the kinetic models, and a t tempt to directly correlate descriptors of the system with performance. However, to develop a forward model reliable enough that it may be extrapolated to new regions of composition space for the overall material problem, it is often essential to utilize all available knowledge about the system at hand.

Material Library

HTE data

Chemistry, Physics or

Biology Models

t

Kinetic Model

t

r Performance" Curves

]

J

Rules Rules

M a t e r i a l S t r u c t u r e to M o d e l P a r a m e t e r s

M o d e l P a r a m e t e r s to P e r f o r m a n c e

Figure 3: Schematic of the overall forward model for systems involving reactions

As shown in Fig. 3, the knowledge about the system captured by the fundamental model may be in terms of chemistry, physics or biology rules. These rules may themselves arise out of both first principle as well as expert knowledge. The development of the fundamental model as well as the kinetic model will be entirely determined by the selection of the rules expected to govern the underlying chemical, physical or biological system

362

at hand. In other words, the process of model development is actually the process of selecting the appropriate set of governing rules.

K n o w l e d g e Extraction: From Rules to Features

The overall forward model may now be defined as the clear and precise representation of all forms of knowledge about the system including first principle, data-driven and expert knowledge. If one wants the full benefits of HTE and the ability to do design, there is no alternative to model development. The reasons are very clear. First, for most complex design problems, the composition space can be so large that even HTE cannot fully explore it. Second, in order to obtain more than just correlations using HTE, knowledge must be extracted, and the extraction of knowledge must be automated so as to keep up with the rate at which data is now becoming available. Finally and probably, most importantly, the systems being modeled are usually far too complex so that the number of ideas that must be addressed simultaneously often exceeds the capacity of human experts. Consequently, a computer-aided Knowledge Extraction (KE) engine with capabilities for both model refinement and formulation of new, critical high-throughput experiments is a necessary component for effective design. Fig. 4, again a generic version of a figure from Caruthers et al. (2002) shows the idea of knowledge extraction and the resulting flow of knowledge.

Knowledge extraction is not a model, but rather a process as explained below. To start with, a set of rules is postulated (possibly by a human expert) to best describe the system at hand. The rules lead to a fundamental model with chemistry, physics or biology knowledge embedded in it. The rules also lead to a kinetic model. Using HTE data, the model parameters are determined and the system performance predictions are obtained. Rather than the quantitative predictions, the qualitative features of the performance prediction vis-a-vis those in the HTE data are often more critical to evaluate the extent of the inadequacy of the postulated model. To handle the shortcomings of the model, the model refinement process is invoked which reselects a set of rules and the process repeats. At the same time, in order to better discriminate between different models or obtain data on different features about the system, new experiments are formulated. Thus, each iteration of the KE process leads to a potentially improved model as well as more discriminating HTE data. The continued interplay between theory and experiment via both a computer-based system and human experts ultimately results in generation of new knowledge. If the KE engine, HTE data and the human expert are working in concert, the process should ultimately begin to converge with each cycle of the process. At the end of the convergence process, the forward model would be essentially complete for the class of systems studied, so that it may then be used in conjunction with a suitable inverse solution method to solve the material design problem.

363

II Formulation of Experiments

High Throughput Experiments

Performance ~ Curves /

Model

Feature Extraction

Rules

r~

II Model Refinement

Figure 4: The Knowledge Extraction process

16.1.2 S e l e c t i o n of R e a c t i o n S c h e m e s

The selection of the best reaction scheme (item 2 of the list of challenges) for the industrial production of chemicals is probably the most challenging unsolved problem. The development of processes requires the consideration of wide variety of reactions, which are involved to transform raw materials into desired products. The identification of the most efficient reaction routes that connect the raw materials and the desired products is known as reaction path synthesis. The necessity of discovering and developing alternative / new reaction paths to obtain existing / new chemicals at tracts more and more attention due to the changing conditions in terms of availability of raw materials and energy, the constraints posed by ecological and health considerations and the shifting requirements of the market. The synthesis of new chemical reaction paths is an attractive objective within the general field of process synthesis. Reaction path synthesis, the generation of a network of alternative routes for the manufacture of a desired product and the selection of an optimal route, represents a key step in the process design. Significant advances have been made by researchers for the synthesis, analysis and evaluation of al ternative reaction routes (Govind & Powers, 1981; Rotstein, Resasco & Stephanopuolos, 1982; Crabtree & E1-Halwagi, 1994; Fornari & Stephanopoulos, 1994; Buxton, Livingstone & Pistikopoulos, 1997; Li, Hu, Li & Shen, 2000).

364

Usually, the synthesis problem falls into one of the following classes: (a) Given a desired product, identify the feasible sets of raw materials, as well as the pathways that correspond to feasible mechanisms - retrosynthesis; (b) Identify potential products starting from available raw materials using feasible mechanisms; (c) Bridge the gap between the given raw materials and products using feasible mechanisms; (d) Identify chemicals and feasible intermediate reaction steps to bridge a given situation with a desired situation, and the reverse - Solvay-type clusters of reactions. Any computer-aided synthesis procedure requires the solution of the following problems: (i) representation of chemical compounds and reactions; (ii) selection of the appropriate synthesis strategy; and (iii) criteria for evaluation.

The existing approaches to the identification of reaction paths can be divided in two categories: information-based systems with their roots in chemistry and logic-based systems with their roots in mathematics. The first approach is based on the chemist's point of view and uses vast amounts of data, which encode available knowledge on chemistry, generating only the rational alternatives. The second approach uses logical constructs of chemistry and mathematical representations of molecules and reactions, synthesizing completely novel reaction pathways, but also large numbers of unattractive solutions. In each approach there is a trade-off between the generality of the methods (the ability to represent many alternatives) and their predictive power (the ability to represent specific reactions in detail), according to the representation system employed.

The future of computer-aided reaction path synthesis systems depends on their ability to provide quantitative evaluation of the generated alternatives. Thus, the priority is to estimate the reactivity, equilibrium conversion, by-product formation etc.; for large-scale commodity products, the reaction path synthesis must be coupled with processing requirements. The key for any quantitative evaluation of the generated alternatives is the prediction of the amount of desired product that can be obtained from each reaction route. Therefore, for each feasible reaction (stoichiometric and thermodynamic), the equilibrium and kinetic conditions must be evaluated. For a limited number of reactions, it is possible to find equilibrium and kinetic data in the literature or to conduct experiments in order to evaluate them. Generally reaction kinetic data (rate constants) are determined experimentally and only knowledge-based systems include such data. However, the reaction path synthesis problem is likely to include hundreds or thousands of both known and new reactions, for which a rapid equilibrium and kinetic evaluation must be implemented. For small molecules and very simple (unimolecular) reactions, computational chemistry may help evaluate the kinetics.

For the generation of alternative reaction paths for the production of a given chemical product, a three-step methodology can be considered as

365

follows: (1) Co-material design step; (2) Reaction path synthesis step; (3) Post-synthesis step. In the first step, the chemical species included in the reaction network are generated using a group-based CAMD procedure (Gani et al. (1991); Constantinou et al. (1995)). Careful pre-selection of these compounds provides an early opportunity to limit the size of the problem. Thus, starting from the desired product and based on its known chemistry, the appropriate set of groups is selected. The CAMD procedure systematically generates the stoichiometric co-material candidates, considering the specified type of compounds and property constraints. Also, to complete any stoichiometry, it may be necessary to include some simple additional molecules, which cannot be systematically designed using the CAMD procedure.

In order to develop the reaction path network using the computer-aided reaction synthesis algorithm, it is necessary to create a specific structure and reaction representation. In the second step (reaction path synthesis), these representations are constructed and subsequently the reaction tree is generated, considering role specifications, chemistry constraints and stoichiometric coefficient constraints. The identification of feasible stoichiometries is performed using a multi-level "Generate and Test" procedure (see Chapters 7 and 14), including the atom balance and thermodynamic and equilibrium conversion constraints, in order to reduce the problem dimensions and to screen out the infeasible stoichiometries. A number of thermodynamic properties (Gibbs free energy of reaction) serve as basis for a selection strategy. The resulting reaction path network includes only the stoichiometric and thermodynamic feasible reactions.

In the third step (post-synthesis), the results from the reaction path synthesis step are analyzed in order to identify the most promising reaction routes in terms of economics, operability, reliability, environmental impact, etc. The analysis involves large amounts of information, such as reaction conditions and kinetics needed to perform process design, simulation and optimization. Various simplifications have been suggested, the most important of which is the hierarchical evaluation of reaction schemes, which progresses through different levels of required detail, e.g. evaluation of alternative reaction schemes at equilibrium conditions, kinetically controlled conditions, considering overall gross added-value economics, toxic raw materials or by-products, etc.

16.1.3 Drug Design

In a recent article, Garg and Achenie (2001) discussed the use of mathematical programming for designing drugs with desired properties. The mathematical programming formulation is solved to obtain optimal descriptor values, which are then employed in the Cerius 2 modeling environment to infer the optimal lead candidates, in the sense that they exhibit both high selectivity and activity while ensuring low toxicity. Both

366

linear and non-linear quantitative structure activity relationships (QSAR's) were developed for use in the approach. The modeling approach was demonstrated for a class of non-classical antifolates for pneumocistis carinii and toxoplasma gondii dihydofolate reductase. Some of the potential leads found in this study have biological properties similar to those in the open literature.

Background

Pneumonia and toxoplasmosis are the major causes of morbidity and mortality in AIDS patients (Vita et. al., 1987). Opportunistic pathogens, Pneumocystis carinii (pc) and Toxoplasma gondii (tg), respectively, cause these diseases via the dihydrofolate reductase (DHFR) enzyme in AIDS and other immunocompromised patients. Existing therapies based on present drugs are either too toxic or not very selective between human DHFR and pcDHFR (or tgDHFR) (Walzer et. al., 1988 and Kovacs et. al., 1988). Currently available antifolate therapies (namely, trimetoprim and pyrimethamine) for pc and tg infections, are weak inhibitors of DHFR. On the other hand, trimetrexate and piritrexim, although 100-10,000 times more potent than trimetoprim and pyrimethamine, are unfortunately strong inhibitors of DHFR from mammalian sources (Gangjee et. al., 1993, 1995, 1996a, and Rosowsky et. al., 1994,1995).

There has been a flurry of research activities reported in the open literature (Chio et. al., 1991, Piper et. al., 1996, Gangjee et. al., 1996b, 1997, 1998) focused on the design of drugs that are selective, i.e. simultaneously active against pcDHFR and tgDHFR and relatively inactive against human DHFR. In these studies, typically the researcher takes an antifolate backbone, changes some of the functional groups, synthesizes the molecule, and performs bioassays to determine if it is a potential lead candidate. This process is naturally very time consuming and expensive. Compounding this problem is the fact that there are several hundred, even millions of possible molecules that can be screened through this approach. This is a Herculean task for any research group to accomplish in a reasonable amount of time. Garg and Achenie propose a computer aided molecular design approach.

Problem Formulation

A CAMD approach is formulated to identify potential leads that are selective, i.e. simultaneously active against pcDHFR and tgDHFR and relatively inactive against human DHFR. In the suggested approach, two sub-problems are considered, namely, a forward problem and an inverse problem. In the forward problem, models are developed to predict the selectivity and activity from molecular descriptors. In the inverse problem, the optimal values (based on selectivity and activity) of the molecular descriptors are determined and an appropriate molecular structure is inferred.

367

For the forward problem, several antifolates, each with a different inhibitory activity characteristics, and with a general structure as:

H2N

NH 2 R1 R21 R 2 ' / ~ R3'

R 5 '

where: W = [N, CH], X = [N, CH], Y = [N, CH2], Z = [N, CH2], R1 = [no substituent, H, C1, CH3], R2 = [H, CH3] and R3 = [H, CH3, CHO, CH2CCH, CH(CH3)2, CH2CH3], are fed into the MSI Cerius 2 modeling environment. The lat ter then gives a unique set of descriptor values corresponding to each molecule. Next a quantitative structure activity relationship (QSAR) between the activities of the antifolate molecules and the descriptor values are developed. QSAR models are also developed for selectivities of the antifolate molecules for pcDHFR (and tgDHFR) versus r/DHFR.

The inverse problem is solved using one of the QSAR models. This results in a set of optimal descriptor values for potential lead candidates with both high selectivity and activity values. The inverse problem is given by

Maximize Selectivity (d)

Subject to Activity (d) >_ Activity_low

where, d is the vector of descriptors. The Selectivity and Activity are models generated from the forward problem. Activity_low is the activity above which the drug has a significant biological effect. The suggested formulation is done bearing in mind the fact that the selectivity of a drug is more critical than its activity since the dosage and/or its form can control the activity of any drug. Note that the drug can be given at a given level of potency or frequency. The dosage form can be intravenous or oral.

The mathematical programming model can handle objectives and constraints different from the suggested one above. The optimal descriptor values from the model are then used to identify the appropriate subst i tuents on the antifolate backbone. In other words, these descriptor values are used to infer the important structural features necessary to at ta in the desired properties.

368

16.1.4 F o r m u l a t i o n s

The mixing of materials to achieve a new or improved product is practiced in many industries, including paints and dyes, foods, personal care, detergents, plastics and pharmaceutical development. Formulated intelligent industrial products use specifically chosen mechanisms to serve the customer by accurately exerting their desired features. According to Kind (2002), the desired features consist of

Per formance- nutritional value, health care, disease prevention, body care, surface protection, crop protection, chemical activity, etc.

Convenience- controlled release of active substance at the location and instance of maximum effect and minimal environmental impact, ease of handling, ease of application, absence of unwanted side effects, etc.

Design of formulated products requires knowledge of the interaction between the microscopic structure and the process at the molecular and nano-scale on the one hand and the macroscopic consumer oriented properties of the product on the other hand. Formulation design problems such as mixing & blending of oils or solvents can be tackled by currently available CAMD techniques. For design of polymer formulations (or blends), ingredients for food, drugs, pesticide, etc. for desired delivery/penetration, however, integration of models of different scales of size, time and complexity with CAMD techniques is needed. This is a very rapidly growing research area and development of integrated computer aided methods & tools for formulation design is certainly feasible. Much work, however, is necessary to measure and collect the necessary data for the development of new property models.

16.2 C U R R E N T T R E N D S TOWARDS P R O B L E M S O L U T I O N S

16.2.1 F l e x i b l e s o l u t i o n s t r a t e g i e s

The step in the product design procedure dealing with the manufacture of the product is in most cases applied in a sequential manner, after the completion of the first three steps (identify needs, design products and test products). For simple solvent design problems, simultaneous solution approaches for process-product design has recently been reported by Hostrup et al. (1999) and Linke and Kokossis (2001). Mathematical programming approaches incorporating product and process design, while attractive, however need first to overcome the problem of property models. As pointed out by Gani (2001), the property models for product design may not be suitable for process design and vice versa. In addition, once a property model is selected for inclusion into the process model, the application range in terms of additional new mixtures (generated by the product design steps) is restricted since for the generated molecules, either

369

the model parameters may be unavailable or the property model may not be suitable. Since in mathematical programming techniques, changing of model equations (included as equality constraints) will cause discontinuities in the solution trajectory, it may become extremely difficult to achieve convergence if multiple versions of models for the same properties were to be used. This problem may however be overcome using mixed integer mathematical programming formulations that use logic or binary variables to represent different models.

Recently, Gani and Pistikopoulos (2002) and Eden et al. (2002) proposed the solution of process as well as product design problems as a series of reverse problems. Jus t as molecular design problems may be formulated as reverse property estimation problems, process design problems may also be formulated as reverse simulation problems. That is, determine the design targets from the process models, given the known process input information and the desired process output information. Eden et al. (2002) have shown that solution of this reverse simulation problem does not require the use of property models in the process model equations since the unknown design targets are functions of the target properties. This means that the target properties can be determined from the solution of the reverse simulation problem by solving a set of linear equations (in most cases) and from these, the design targets are calculated.

The advantage of this procedure is that solution complexity has been reduced without sacrificing solution accuracy. Also, note that the dependence on property models for performing mass- and energy-balance calculations has been eliminated (from this step). In order to complete the process design, conditions of temperatures, pressures and/or compositions are determined in the next step, where a reverse property estimation problem is solved. Here, the property values (calculated from the first step) and mixture compounds are known but the variables defining the condition of operation (temperature, pressure and/or composition) are unknown. As long as the target properties (from the reverse simulation step) are matched to some degree of tolerance, any number of property models may be used for this reverse property estimation step. The hybrid CAMD methods are designed to handle multiple property models (see figures 18-19 in chapter 6) and has the flexibility to design the condition of operation if the compounds are known and vice versa. Note that the reverse simulation problems may also be solved in order to "define needs" for the CAMD problem.

Integrated process-product design problems may also be tackled by decomposing the problem into two reverse problems and i terat ing on the connection between the two reverse problems until the optimal solution has been achieved (see figure 6). That is, the reverse simulation problem determines the target design values for a specified set of product qualities (chemical identities are not necessary since property models are not used in the reverse simulation problem). Matching the design targets and

370

generating new input/output stream parameters and determining the chemical identities solve the reverse property estimation problem. This information is fed back to the reverse simulation problem and a new iteration loop is started. Eden et al. (2002) provide an illustrative example on how the reverse problem formulation works for integrated process- product design.

Figure 6: Reverse problem formulations for integrated process-product design (Eden et al. 2002)

Another opportunity in developing flexible solution strategies is in the area of formulated products. Here, the design problem is to find a formulation that when added to another product, enhances its function. Thus, the design of the formulation (commonly known as active ingredient) and the testing of the final product need to be performed simultaneously. Take for example, the case of drug delivery, pesticide uptake, polymer blends for specific applications, inhibitors for drugs, elastomers in chemical products and many more. In all these product design problems, models of the process (phenomena) need to be combined with the search of property-based formulations. In many cases, one starts with modeling of the diffusion process (for example, in drug delivery) and then relates the sensitive parameters for this process to the target properties of the desired active ingredients. The process model in this case is represented by a system of partial differential-algebraic equations where a number of the terms are represented by properties such as diffusion, thermal conductivity, viscosity, etc., for which property models (or constitutive models) need to be introduced. The difficulty in solving these problems in a general manner is that the property model may be

371

valid only for a certain range of conditions and/or mixtures. The other difficulty is that the models may be complex, requiring higher levels of molecular structural information. Therefore, multi-level and multi-scale modeling approaches need to be considered together with an optimization strategy to identify the best formulation. Again, a decomposition of the problem into reverse problems may be a more pragmatic and flexible way to solve these product design problems.

16.3 E N A B L I N G T E C H N O L O G I E S

The Journal of Computer-Aided Molecular Design started in 2001 had an initial emphasis on drug discovery and design. To get a sense of current trends in CAMD (at least from a drug discovery point of view), one needs to look at the submission areas that the journal actively solicits manuscripts" theoretical chemistry; computational chemistry; computer and molecular graphics; molecular modeling; protein engineering; drug design; expert systems; general structure-property relationships; molecular dynamics; and chemical database development and usage. Researchers in CAMD (as defined in this book) are making contributions in all of the above areas except theoretical chemistry and computer and molecular graphics. Some of the above areas are further discussed below.

In his lecture, Tomasi (1999) quoted Dirac (1929) as follows:

The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations too much complicated to be soluble. It therefore becomes desirable that approximate practical methods of applying quantum mechanics should be developed which can lead to an explanation of the main features of complex atomic systems without too much computation.

Developers of computational chemistry software (such as Gaussian, Jaguar, GAMESS) have largely headed the advice given in the second sentence, although the amount of computations is still too large. In computational chemistry, sophisticated algorithms from numerical mathematics are employed to elucidate molecular properties. In addition, fast computers (for example parallel/distributed computers, networked computers, and agent-based computing) are employed. If the view expressed in the opening sentence were correct (most computational chemists agree with this view), then expected future advances in both hardware (faster and cheaper computers) and computational chemistry software would revolutionize property modeling at various scales. Since the availability, accuracy and speed of computation of property models are the "Achilles' heel" of CAMD, the effect on CAMD would be very profound indeed.

372

Current computational chemistry algorithms have had reasonable successes. For example, Sandia National Labs (Albuquerque, NM, http://www.bmpcoe.org/bestpractices/internal/sandi/sandi 57.html) have reported the use of quantum chemistry modeling to determine the structure and energetics of a newly discovered fullerene (an allotrope of carbon). Nyden and Brown (1993) report using molecular dynamic simulations and cone calorimeter measurements to gauge the effects of electron beam irradiation and heat treatments on the flammability of the honeycomb composites used in certain parts of commercial aircraft.

The interactions between a protein and substrates dictate essentially all functions of an organism (http://www.ram.org/research/pfp.html). Protein- substrate and protein-protein interactions depend critically on the 3D structure of the protein. The formation of this 3D structure results from folding of the protein, a problem for which scientists have spent well over 30 years trying to understand. Understanding and modeling protein folding (i.e. activity) has enormous implications in drug discovery since in principle the 3D structure of a protein may then be controlled (i.e. designed) in order to dock properly to a given substrate or protein. Modeling and simulating protein folding is rather complex and computer intensive. The protein-folding problem can benefit from recent advances in computing and algorithms including global optimization (see, for example, Floudas et al. (1999)). Likewise property modeling and prediction in CAMD can benefit greatly from global optimization algorithms under certain conditions: the algorithms should be easy to use, intelligent enough to quickly recognize whether or not a solution exists, robust enough to easily converge to a solution if one exists, and most importantly memory usage and computations should scale almost linearly with the problem size.

In the chemical database area there is a concentrated effort to archive experimental data and property prediction methods in an intelligent database that allows easy retrieval of pertinent information. The CAPEC database (http://www.capec.kt.dtu.dk/main/software /database) at the Technical University of Denmark has a large collection of experimental data, which can be used for developing new models or verifying the accuracy of existing predictive models with respect to application in CAMD. The database from Cranium (Molecular Knowledge Systems, Inc., http://www.molknow.com/) provides physical property estimation. In addition, the database available in SMSWIN (see chapter 9) has a large collection of compound related data. The PARIS-II project (Cabezas, 2000) offers solvent selection tools based on database search and on-line calculation of properties. The database area would benefit greatly from advances in data mining, knowledge extraction (see section 16.1.1) and protocols such XML. As computational chemistry becomes more accurate for both small and large molecules, gaps (absence of experimental data) in

373

property databases can be partially offset by computational chemistry data.

16.4 CONCLUSIONS

We conclude the book with a note on the key improvements required in the product design (inverse) problem strategies to meet the product needs of the future. Before that, we will briefly summarize some of the aspects of the CAMD problem that were addressed in the book. The book primarily focused on the computer aided molecular design problem and highlighted its key issues. A background was provided of the required forward modeling effort ranging right from linear group-contribution models to hybrid approaches based on complex, knowledge-extraction architectures that strive to integrate first-principles, expert and data-driven knowledge. In terms of the inverse or reverse problem, a variety of methods were discussed in detail including generate-and-test methods, mathematical programming, evolutionary algorithms and hybrid models. The case studies presented in the book highlighted the application and practice of some of these methods.

16.4.1 Advanced Product Design Strategies

Much of the current work in product design is carried out through empirical, trial and error approaches involving time-consuming experiments. It is important to capture the knowledge gained from past experiments and apply them in a systematic manner so that future efforts will need fewer trials and therefore fewer experiments. In this context, a major effort is needed to understand the molecular structure-property relationships, collect experimental data, develop mathematical models, and apply the solution techniques to identify/design new products and processing routes.

Recent successful applications of CAMD to the development of new agrochemicals, materials, and pharmaceuticals can be found in the book edited by Reynolds et al. (1995). Most of these successes employ techniques such as CoMFA, molecular dynamics, de novo ligand design, QSAR, molecular orbital methods, and genetic algorithms. In these applications important properties include interfacial phenomena and pharmacokinetic properties such as transport and metabolism.

This chapter introduced the broader problem of material or product design. It provided a flavor for the kind of modeling effort that will be required when the system at hand or its performance measure is too complex to be modeled by simple property prediction methods. In such cases, the forward model itself will be a complicated and computationally intensive process. With the forward model presenting the computational bottleneck, the inverse problem solution strategy will itself also need to be

374

modified. Typical inverse methods discussed in this book such as conventional mathematical programming or simple genetic algorithms will no longer be feasible to solve the inverse problem in reasonable amounts of time under a computationally intensive forward model. These search algorithms will have to be redesigned such that all knowledge (fundamental or expert) about the underlying system would be suitably exploited to obtain a guided search procedure capable of exploring the search space rapidly despite the computation limitations posed by the forward model. As of now, only some highly preliminary efforts exist towards this end and the problem is far from solved. However, this is the key challenge that will ultimately have to be dealt with to produce an intelligent and efficient, knowledge-driven design system capable of handling the complex material design problems in the years to come.

16.3.2 Mult idiscipl inary Approach

Since chemical product design problems are multidisciplinary in nature, development of a systematic framework based on identified workflow and data-flow for the various inter-related activities would make a significant contribution. The framework needs to consider the human-computer interactions and allow the human to control the workflow while the computer performs tasks that are calculation intensive in the workflow and most of the tasks in the data-flow. In this way, the human concentrates on the tasks he/she can efficiently solve while the computer concentrates on the tasks it can perform very efficiently. The systematic framework could serve as the basis for state-of-the-art computer-aided tools utilizing existing databases, mathematical models and efficient solution techniques. Note that while the computer-aided tools will depend on the availability of appropriate models, the systematic framework can be used even if the models are not available.

16.4 REFERENCES

1. A. Buxton, A.G. Livingston and E.N. Pistikopoulos. Reaction Path Synthesis for Environmental Impact Minimization. Computers Chem. Engng. 21, $959-$964 (1997)

2. J .M. Caruthers, J. A. Lauterbach, K. T. Thomson, V. Venkatasubramanian, C. M. Snively, A. Bhan, S. Katare and G. Orkarsdottir, J. Catalysis, (2002), submitted for publication.

3. L . -C. Chio, and S. F. Queener, S.F. "Identification of highly potent and selective inhibitors of Toxoplasma gondii dihydrofolate reductase." Antimicrob. Agents Chemother. 37 (1991) 1914-1923.

4. E.W.Crabtree, and M.M. E1-Halwagi. Synthesis of Environmentally Acceptable Reactions. AIChE Symposium Series, Volume on Pollution Prevention via Process and Product Modifications 90, 117- 127 (1994)

375

5. V.T. De Vita Jr., S. Broder, A. S. Fauci, J. A. Kovacs, B. A. Chabner, Ann. Intern. Med. 106 (1987) 568- 581. P. A. M. Dirac, 'Quantum mechanics of many-electron systems', Proceedings of the Royal Society (London), A 123 (1929) 714-733.

7. M.R. Eden, S. B. Jorgensen, R. Gani, M. E1-Halwagi, "Property integration - A new approach for simultaneous solution of process and molecular design problems", Computer Aided Chemical Engineering, J. Grievink and J. van Schijndel (Editors), Vol. 10 (2002) 79-84.

8. C.A. Floudas, J.L. Klepeis and P.M. Pardalos, "Global Optimization Approaches in Protein Folding and Peptide Docking", DIMACS Series in Discrete Mathematics and Theoretical Computer Science, (Ed. F. Roberts), 47 (1999) 141-171. T. Fornari, and G. Stephanopoulos. Synthesis of Chemical Reaction Paths: The Scope of Group Contribution Methods. Chemical Engineering Communications 129, 135-157 (1994)

10. A. Gangjee, A. P. Vidwans, A. Vasudevan, S. F. Queener, R. L. Kisliuk, V. Cody, R. Li, N. Galitsky, J. R. Luft, and W. Pangborn, "Structure-based design and synthesis of lipophilic 2,4-diamino-6- substituted quinazolines and their evaluation as inhibitors of dihydrofolate reductases and potential antitumor agents" J. Med. Chem. 41 (1998) 3426-3434.

11. A. Gangjee, A. Vasudevan, S. F. Queener, and R. L. Kisliuk "2,4- Diamino-5-deaza-6-substituted pyrido[2,3-d]pyrimidine antifolates as potent and selective nonclassical inhibitors of dihydrofolate reductases" J. Med. Chem. 39 (1996a) 1438-1446.

12. A. Gangjee, A. Vasudevan, S. F. Queener, and R. L. Kisliuk, "6- Substituted 2,4-diamino-5-methylpyrido[2,3-d]pyrimidines as inhibitors of dihydrofolate reductases from Pneumocystis carinii and Toxoplasma gondii and as antitumor agents" J. Med. Chem. 38 (1995) 1778-1785.

13. A. Gangjee, J. Shi, S. F. Queener, L. R. Barrows, and R. L. Kisliuk, "Synthesis of 5-Methyl-5-deazononclassical antifolates as inhibitors of dihydrofolate reductases and as potential antineumocystis, antitoxoplasma, and antitumor agents" J. Med. Chem. 36 (1993) 3437-3443.

14. A. Gangjee, R. Devraj, and S. F. Queener, "Synthesis and dihydrofolate reductase inhibitory activities of 2,4-diamino-5-deaza and 2,4-diamino-5,10-dideaza lipophilic antifolates" J. Med. Chem. 40 (1997) 470-478.

15. A. Gangjee, Y. Zhu, S. F. Queener, P. Francom, A. D. Broom, A.D. "Nonclassical 2,4-Diamino-8-deazafolate analogues as inhibitors of dihydrofolate reductases from rat liver, Pneumocystis carinii, and Toxoplasma gondii." J. Med. Chem. 39 (1996b) 1836-1845.

16. R. Gani, "Computer aided process/product synthesis and design: Issues, needs and solution approaches", paper 264a, AIChE Annual Meeting, Reno, USA, Nov. 4-9, 2001.

17. R. Gani, E. N. Pistikopoulos, "Property modelling and simulation for

.

.

376

product and process design", Fluid Phase Equilibria, 194-197 (2002) 43-59.

18. S. Garg, and L.E.K. Achenie, "Mathematical Programming Assisted Drug Design for Non-classical Antifolates," Biotechnology Progress, 17 (2001) 412-418.

19. R. Govind, and G.J. Powers. Studies in Reaction Path Synthesis. AIChE J. 27(3), 429-442 (1981)

20. M. Hostrup, P. M. Harper, R. Gani, 'Design of Environmentally Benign Processes: Integration of Solvent Design and Process Synthesis', Computers and Chemical Engineering, 23 (1999) 1394- 1405.

21. M. Kind, personal Communications, University of Stuttgart, Germany (2002).

22. L. Klientjens, Thermodynamics of organic materials. A challenge for the coming decades", Fluid Phase Equilibria, 158-160 (1999) 113- 121.

23. J.A. Kovacs, C. A. Allegra, J. C. Swan, J.C. Drake, J.E. Parrillo, B. A. Chabner, and H. Masur, "Potent antipneumocystis and antitoxoplasma activities of piritrexim, a lipid-soluble antifolate" Antimicrob. Agents Chemother. 32 (1998) 430-433.

24. M. Li, S. Hu, Y. Li and J. Shen. Reaction Path Synthesis for a Mass Closed-Cycle System. Computers Chem. Engng. 24, 1215-1221 (20O0)

25. P. Linke, A. Kokossis, "Simultaneous synthesis and design of novel chemicals and chemical process flowsheets", Computer Aided Chemical Engineering, J. Grievink and J. van Schijndel (Editors), Vol. 10 (2002) 115-120.

26. J. E. Nyden, M. R.; Brown, J. E., "Computer-Aided Molecular Design of Fire Resistant Aircraft Materials" Federal Aviation Administration (FAA). International Conference for the Promotion of Advanced Fire Resistant Aircraft Interior Materials. February 9- 11, 1993, Atlantic City, NJ, 147-158 pp, 1993.

27. J .R. Piper, C.A. Johnson, C.A. Krauth, R. L. Carter, C. A. Hosmer, S. F. Queener, S.E. Borotz, and E. R. Pfefferkorn "Lipophilic antifolates as agents against opportunistic infections. 1. Agents superior to Trimetrexate and Piritrexim against Toxoplasma gondii and Pneumocystis carinii in in vitro evaluations" J. Med. Chem. 39 (1996) 1271-1280.

28. C.H. Reynolds, "Computer-Aided Molecular Design Applications in Agrochemicals, Materials, and Pharmaceuticals", Edited by C. H. Reynolds, M. K. Holloway, and H. K. Cox] ACS Symposium Series 589, ACS, Washington DC, (1995) 396-414.

29. A. Rosowsky, C.E Mota, J.E. Wright, and S.F. Queener, "2,4- Diamino-5-chloroquinozoline analogues of trimetrexate and piritrexim: Synthesis and antifolate activity" J. Med. Chem. 37 (1994) 4522-4528.

30. A. Rosowsky, R. A. Forsch, and S.F. Queener, "2,4- Diaminopyrido[3,2-d]pyrimidine inhibitors of dihydrofolate

377

reductase from Pneumocystis carinii and Toxoplasma gondii" J. Med. Chem. 38 (1995) 2615-2620.

31. E. Rotstein, D. Resasco and G. Stephanopuolos. Studies on the Synthesis of Chemical Reaction Paths - I. Chemical Engineering Science 37(9), 1337-1352 (1982)

32. J. Tomasi, "Towards 'chemical congruence' of the models in theoretical chemistry", HYLE- An International Journal for the Philosophy of Chemistry, 5 (1999) 79-115.

33. P.D. Walzer, C.K. Kim, J.M. Foy, M.T. Cushion, "Inhibitors of folic acid synthesis in the treatment of experimental Pneumocystis carinii pneumonia" Antimicrob. Agents. Chemther. 32 (1988) 96.

G l o s s a r y of T e r m s

ABS API Aprotic Solvent

Basis set

BB

Binary variable Building block

CAMbD CAMD

CAMD design algorithm

CAMD framework

CAMD problem

CAMD solution CAMD solution step

CAMS

Candidate selection

CAPD

Cardinality CFC Chem3D Chemometrics

Alkyl Benzene Sulfonates Active Pharmaceutical Ingredient A term used to describe solvents of both high and low polarity which do not readily give a proton to a base The set of groups from which a molecule may be assembled. Branch and bound method. It is a strategy for obtaining the global minimum (or maximum) of a mathematical program that has discrete variables (sometimes in addition to continuous variables). Has a value of either 0 or 1 The pieces the molecular models are assembled from in CAMD - a group or fragment Computer aided mixture/blend design Computer Aided Molecular Design - the generation of molecules from fragments using a computerised technique The set of sub-algorithms used to solve a CAMD problem The overall collection of algorithms for formulating, solving & analysing the solution results of CAMD problems and the sequence they must be applied in. The task of generating compounds matching a set of properties A solution to a CAMD problem The general procedure of solving CAMD problems - in the developed framework this performed in the design phase Computer Aided Molecular Search- identification of compounds having specific properties by systematic searching in databases Finding the most promising candidates among the solutions from the design phase Computer aided product design (CAMD + CAMbD) Number of members Chlorofluorocarbon Commercial molecular modeling software The simulation of reaction systems with kinetic models and principal factor analysis to identify

380

CHRIS Comaterial Computational load

Connectivity

CPLEX CSTR CTAM Database lookup

Descriptors (structural)

Design considerations

Design constraints

Design phase

Design specifications Desirable property

Desirable qualities

DIC DICOPT

Dimensionality

DIU DMAC DMF EH&S

the major pathways Database on the internet (see chapter 1) Stoichiometric by-product The amount of calculations required to solve a CAMD problem How the atoms of a combined are interconnected MILP solver (www.cplex.com) Constant Stirred Tank Reactor Critical Air Mass Performing searches in a database for records with specific data Numbers or other information describing something about the structure of a molecule - a group vector is a set of structural descriptors The aspects taken into account when formulating a CAMD problem - explicit considerations must be reformulated as constraints in order to apply the CAMD algorithm while implicit considerations are treated by restricting the types of molecules generated or by analysing the results in the post-design phase constraint specification expressing the problem formulation as constraints The requirements a compound should fulfill in terms of properties The part of the CAMD framework responsible for solving a CAMD problem by using the CAMD design algorithm The same as constraints A constraint on a property controlling the suitability of a compound. Typically a relative specification such as "as high as possible", "as low as possible" or "as close to a goal value as possible" Qualities that would enhance the suitability of a compound Diisopropyl carbodiimide Discrete and Continuous Optimizer (this solver is also available in GAMS) The dimensionality of a molecular model is a measurement for the level of detail contained in it Diisopropyl Urea Dimethyl Acetamide Dimethyl Formamide Environment, Health and Safety

381

EH&S properties

Environmental impact

Essential property

Essential qualities

Est imated properties

External substance

Feasibility

Constraints relating to the Environment, Health & Safety of an operation The consequences of discharging a compound into the environment A constraint indicating a property value or interval required in order for a compound to be used for a particular application Qualities that a compound must posses in order to be usable Properties estimated using a property prediction method Compound not participating as a reactant or product in a process If a compound can be expected to exist in nature or be synthesized successfully

Feasibility requirements The requirements a molecular model must

FMS Formulation

Forward problem Fragment

Free-at tachment

GA GAMS

GCA

GC-EOS Generate and Test Generation level

Global optimization

fulfill in order to be regarded as being a feasible compound Final molecular structures CAPD problems that refer to mixture/blend design and or addition of additives to a product in order to enhance the product performance or quality (see chapters 11 and 16) Property prediction A fragment or sub-part of a molecule, typically the same as a group - in CAMD fragments are used as building blocks functional group a group defining the family of the compound it appears in (e.g. OH defines a compound as being an alcohol). A group has a free a t tachment if it is available for bonding to another group Genetic Algorithm General Algebraic Modeling System (commercial software) Group Contribution Approach- property prediction based on the assumption that a fragment has the same contribution to a property regardless of the compound it is found in Group contribution equation of state A combinatorial approach The level of detail in the generated molecule models Identification of the absolute minimum (or maximum) point within the range of allowed values of the design variables and the region defined by the constraints.

382

Group

Group classification

Group vector

Group-set

Hetero atom Hildebrand (solubility) Parameter HMPA HSDB Hybridization

ICAS

IMS IVD LCA LIBRA

Log P Mathematical programming

MC MEIM

Metha groups

MILP

MINLP

MM2

MOLDES Molecular detail

A clearly defined substructure of a molecule, is part of a group-set and forms the basis for the GCA prediction methods The subdivision of the groups from a group-set into classes and categories A collection of groups (taken from a group-set) that defines a compound. Each group appears in the vector the number of times it can be found in the compound A set of molecular fragments used to describe compounds. An example is the groups defined in the UNIFAC method Non-carbon atom in (aromatic) ring. An indicator for solubility/miscibility

Hexamethyl Phosporamide Database on the internet (see chapter 1) The bond configuration for an atom level the generation steps taken in the design phase Integrated Computer Aided System that contains ProCAMD as one of the tools Intermediate molecular structures Intake Valve Deposits Life Cycle Assessment Interval arithmetic based global optimization package Octanol-water partition coefficient A mathematical model consisting of a performance objective, constraints (including material balances and other process constraints) and design variables that can be manipulated to optimize the performance objective. Main-chain (see chapters 5 & 13) Method for Environmental Impact Minimization Groups with the same combination properties (see Chapter 2) Mixed Integer Linear Programming- solving nonlinear optimisation problems where some variables must have an integer value Mixed integer nonlinear program. This is a mathematical program involving both discrete and continuous design variables. A molecular mechanics method for doing calculations on 3D molecule models CAMD software (see chapter 2) The amount of structural detail embedded in a

383

Molecular model

Molecular modeling

Molecule representation MOPAC

MSA MTBE MW NFA NN Nonconvex equation Nonlinear equation ODP Optimal solution

OSL PARIS-II

PEL Performance criteria

PET PFR PLS Post-design phase

Pre-design phase

Primary properties

Problem (CAMD) formulation

ProCAMD

Property intervals

Property level

Property prediction for

molecular model An electronic representation]model of a molecular structure Calculations of compound structures and properties using 3D molecular models The structure of a molecular model A computer program for doing ab initio calculations on molecules. Available in many versions but commercially sold by Fujitsu Inc. Mass separating agent Methyl Tertiary Butyl Ether Molecular weight Number of free attachments Neural Network Does not have a unique minimum point Variables appear with indices other than 1. Ozone Depletion Potential In MINLP based CAMD solution algorithms the compound having the optimum value of the objective function. In the developed framework the compound most suited for the intended use MILP solver (www.research.ibm.com/osl) A software developed by US-EPA for solvent selection Permissible Exposure Limit A measurement of how well a compound performs a given task Polyethylene Terephthalate Plug Flow Reactor Partial Least Squares The part of the CAMD framework responsible for analysing the results obtained from the design phase The part of the CAMD framework responsible for formulating a CAMD problem Properties predicted purely on the basis of the molecular structure Identifying the goals of the design process properties The physical and chemical properties of a compound (e.g. boiling point, melting point etc.) Software developed at CAPEC based on the hybrid CAMD method The interval a property value must lie in order to be suitable for given application How complex the calculation of a given property with a given method is The prediction of properties based on molecular

384

C A M D - level 1 Property range

Property trust

ProPred PVP QSAR

QSPR

Qualities

Reverse problem

RTECS SC Secondary properties

SEVIN

SMS SMSWIN

SOLV-DB Solvent

Spanning tree SQP

Steric information

structural information The total set of properties that has to be calculated in order to evaluate all the design constraints A qualitative measure for the quality of an estimation Pure component property estimation package Ploy(vinylidene propylene) copolymer Quantitative Structure Activity Relationships - Property prediction techniques relating the an activity (toxicity, biodegradability, bioaccumulation) to the molecular structure Quantitative Structure Property Relationships - Property prediction based on the assumption that a property is related to the molecular structure of the compound. Related to GCA methods A qualitatively defined behavior or capability- like "liquid at ambient temperature" or "good solvent for phenol" CAMD could be regarded as the reverse of property prediction Database on the internet (see chapter 1) Side-chain (see chapters 5 & 13) Properties predicted using primary properties and/or temperature and pressure The trade name for l-naphthalenyl methyl carbamate Solvent Molecular Structure Software developed at Syngenta, which is useful for solvent selection (see chapter 9) Solvent Database (see chapter 11) A solvent is that constituent of a solution that is liquid in the pure state, is usually present in the larger amount and has dissolved the other constituent (a solute) of the solution. The solute may be a solid, a liquid or a gas. The solvent may be a single compound or a mixture of compounds A tree in a graph including all vertices Sequential (or successive) Quadratic Programming Information regarding the (relative) spatial positions of atoms (to each other)

Structural feasibility An implementation of the octet rule constraint Structure (molecular) The internal organization of atoms and bonds

that form an atom - represented in calculations

385

Substructure searches

Target property in CAMD

Uncertainty (properties) Undesirable candidates

Undesirable qualities UNIFAC

UNIQUAC

UPBD US-EPA

VOC WAR

by a molecular model The identification of fragments in a molecule model by use of an algorithm A physical property whose value needs to be within a given range for the candidate molecule (product) The inverse of Property Trust Candidates not fulfilling the requirements regarding feasibility or properties Qualities not desired in a candidate compound Group contribution based model for predicting the liquid phase activity coefficients of compounds present in a mixture Model for estimation of liquid phase activity coefficients (requires information on molecule- molecule interaction as opposed to group-group interaction) Upper bound of objective function United S ta t e s - Environmental Protection Agency Volatile Organic Matter Waste Reduction algorithm

S u b j e c t I n d e x

Subject / Topic

Acetic Acid Production Routes - Carbaryl Example Adaptation of Genetic Operators- Polymer Design Additional Structural Restrictions ADOL-C Advanced Product Design Strategy Analysis of Design Solutions Application Example - Optimization in CAMD Application Example - Problem Description Atom Balance Basic Set - Design of Aqueous Blanket Wash Blends Branch-and-Bound Algorithm Preliminaries Calculation of Properties in Level 2 CAMD Algorithm CAMD Framework CAMD Phase - Extraction Solvent Replacement CAMD P h a s e - Mass Separating Agent CAMD Problem Formula t ion- Fuel Additives CAMD Problem Specification CAPD Carbaryl Production Routes Carbon Structure Constraints Case Studies - GA based CAMD Case Study CAMD_I - Optimal Solvent Design Case Study CAMD_2 - Optimal Solvent Design Case Study CAMD_3- Optimal Solvent Design Case Study in Identification of Multistep Reaction Stoichiometries Case Study in Optimal Solvent Design Case Study Objective - Design of Aqueous Blanket Wash Blends Case Study: Production of 1-Naphthalenyl Methyl Carbamate Challenge Prob lem- CAMD Industrial Example Challenges Challenges and Opportunities for CAMD Challenges for the Early Evaluation Tools Chemical Feasibility Rules Chemistry Constraints Chemistry Constraints - Carbaryl Example Choice of First order Groups - Refrigerant Design

Page

321 114 177 272 373 156 55 84

180 279

46 149

14 19

219 225 334 130 262 199 184 117 25O 253 254 319

247 278

198

226 357 357 230 174 182 323 293

388

Churi-Achenie Octet Rule Model Classification of Groups Co-Material Design Co-Material Design (Results) - Carbaryl Example Co-Material Design Procedure Combination & Feasibility Rules Construction of Estimators - Optimal Solvent Design Creation of Atomic Based Adjacency Description Current Trends Towards Problem Solutions Decision Tree Property Model Selection Definition of Structural Variables Description of Group Contribution Method Design for Maximum Solubility- Design of Fuel Additives Design for Minimum IVD Design of Fuel Additives Design of an Aromatic Compound Design Phase Design-Relevant Building Blocks Desirable Properties DICOPT (Discrete and Contituous OPTimizer) Drug Design EH&S and Special Properties Enabling Technologies Essential Properties Evolutionary Design of Fuel Additives Extension of Hybrid CAMD Method to Complex Molecules Extraction Solvent Replacement- CAMD Industrial Example Extractive Distillation Facets of Solvent-Based Processing Routes that Need Consideration Feasibility Criteria for the Synthesis of Linear Branched Structures Final candidate Selection First Order Groups and their Bonds Fitness Func t ion- Polymer Design Flexible Solution Strategies Flowchart of the Global Optimization Algorithm Forbidden Bond and Other Specific Constraints - Refrigerant Design Formulations Forward Problem (results) - Design of Fuel Additives Forward Problem (solution strategy) - Design of Fuel Additives From Molecule to Materials GA- Background GA- Building Block Hypothesis

251 25

172 324 174 27

257 150 368 232

74 66

348

349 84

139 330 133 299 365 134 371 133 348 168

215

37 232

30

157 69

113 368 273 295

368 345 336

358 97

110

389

GA- Fitness Function GA- Forma Theory GA- Genetic Encoding GA- Implementat ion GA- Replacement Policy GA- Schema Theory GA- Selection of Parents GA- The Polymer Design Problem GA Based S e a r c h - Polymer Design GA Parameters - Polymer Design GAMS Interface General Problem Formula t ion - Optimization Methods in CAMD Generalized CAMD Framework Generation Algorithm for Level 1 - Hybrid CAMD Method Generation Algorithm for Level 2 - Hybrid CAMD Method Generation Algorithm for Level 3 - Hybrid CAMD Method Generation Algorithm for Level 4 - Hybrid CAMD Method Generation Level Generation of 3D Structures Generation of Feasible Molecular Structures Generation of Group Vectors from 1st-Order Groups Generation of Structural Isomers from Group Vectors Genetic Algorithms & Genetic Programming Genetic Algorithms Based CAMD Genetic Search R e s u l t s - Polymer Design Global Optimization Methods Based on Interval-Analysis Hybrid CAMD Method Hybrid Generate & Test CAMD Algorithm Hybrid Modelling Approach Identification of Environmentally Benign Stoichiometries Identification of Forbidden Bonds Between Groups Identification of Multistep Reaction Stoichiometries Incorporation of High-Level Knowledge: Molecular Stability Insertion and Dele t ion- Polymer Design Integration of Process-Product Design Interval Analysis - Brief Introduction Inverse Problem (solution strategy) - Design of Fuel Additives Knowledge B a s e - Hybrid CAMD Method Knowledge Ex t rac t ion - From Rules to Features LIBRA

100 109

98 98

103 104 100 110 307 308 299

65

214 141

147

150

154

18 153 27

141 145 97 95

310 268 129 140 359 171

82 167 306

116 157 266 341

132 362 271

390

Linear Estimators and Branching Functions- B&B Method Liquid Extraction Lithographic Blanket Washes Lower Bound Algorithm-B&B Method Main-chain Mutation and Side-chain Mutat ion- Polymer Design Mass Separating Agen t - CAMD Industrial Example Method & Constraint Selection- Hybrid CAMD Method Methods & Tools- Optimization in CAMD Mixture Design Problem Formulation Mixture Properties Mixture Property Models - Design of Aqueous Blanket Wash Blends Molecular Complexity- Polymer Design Molecular Design- Generation & Test Methods Molecular Design of Fuel Additives Molecular Encoding Technique Molecular Representation Molecular Structure Representation Molecular Synthesis Molecule Representation- Polymer Design Multidisciplinary Approach Multi-Step Stochiometry Identification Results Multi-Step Stochiometry Identification Results - Carbaryl Example Multistep Stoichiometry Identification Algorithm Near-optimal Solutions- Polymer Design New Group Combination Property Characterization Nitric Acid Oxidation of Anthracene to Anthraquinone- CAMD Industrial Example Octet Rule Odele-Machietto Octet Rule Model Optimization Methods in CAMD - I Optimization Methods in CAMD - II Parametric sensitivity and Robustness Analyses for GA'S Polymer Design Case Study Post-Design Phase Post-Design Phase - Extraction Solvent Replacement Post-Design P h a s e - Mass Separating Agent Pre-Design Phase Pre-Design P h a s e - Extraction Solvent Replacement Pre-Design Phase - Mass Separating Agent Prediction of Properties Primary Pure Component Properties Problem Definition- Design of Fuel Additives Problem Definition- Design of Optimal Solvent (Case Study)

50

35 263

48 115

222 132

55 264

12 285

306 23

329 164

67 15 24

112 374 201 324

194 311

29 236

177 251

43 63

312 303 156 221 225 130 217 224

9 10

332 248

391

Problem Defintion - Optimization Methods in CAMD Problem Formulation Algorithm- Hybrid CAMD Method Problem Fromulat ion- Reaction Stoichiometries Case Study Problem Solution- Refrigerant Design Problem Type and Solution- Optimization Methods in CAMD ProCAMD & Chem3D Product Design for Performance Properties Handled in Level-1 Property Constraints Property Constraints and Objective Function- Refrigerant Design Property Level Property Prediction in Level 3 Property Range Property Trust Proposed GA Framework- Polymer Design Reactor Process Model Equations Reducing the Combinatorial Size of the Problem Refrigerant Design Case Study Results & Discussion- Design of Aqueous Blanket Wash Blends Results and Discussion - Design of Fuel Additives Reverse Problem Formulations for Integrated Process- Product Design Role Specification Constraints Role Specification Constraints- Carbaryl Example Role Specification Constraints - Reaction Stoichiometries Case Study Schematic of Lithographic Printing Secondary Pure Component Properties Selection of Branching Function - B&B Method Selection of Reaction Schemes Single Step Stoichiometry Enumeration Single-point Crossover- Polymer Design SMSWIN Solvent for Dehydrat ion- CAMD Industrial Example Solvent for Ethanol Recovery Solvent for Separation of n-Propyl Acetate from n-Propyl Alcohol Solvent Mixture Design Solvent Selection Cri ter ia- Nitric Acid Oxidation Example Solvent Selection in Industry- I Solvent Selection in Industry- II Solvent Selection Methodology in ICAS

43 137

319

299 81

168 329 144

6 290

18 151

18 19

111 188 32

289 283

344 370

182 322 320

247 12 54

363 193 114 231 242

38 39

261 236

213 229 235

392

Solvent Selection Methodology in SMSWIN Solvent Selection using ProCAMD Solvent Selection using SMSWIN - Nitric Acid Oxidation Example Solving the Multistep Stoichiometry Identification Problem Special Features for Complex Solutes- CAMD Framework SQP Step by Step Algorithm for Solution Technique- B&B Method Step by Step Algorithm for the Solution Technique Stoichiometry Identification Formulation Structural Feasibility Constraints Structure-Property Relationships - Refrigerant Design Summary of Problem Formulation- Refrigerant Design System Gibbs Free Energy Target Molecule Identification Target Polymers and their Properties- Polymer Design Test or Molecule Evaluation Stage The Algebra of Genetic Algorithms The Blending Operator- Polymer Design The Evolution of CAMD The Hop-mutation Operator- Polymer Design Thermodynamic and Environmental Property Equations Upper Bound Algorithm-B&B Method VerGO What is CAMD? Whole Number Stoichiometries Constraints

234 238 237

192

214

272 54

271 179

76 291 298 189 196 305

34 104 117

24 117 187 49

271 3

180

A u t h o r I n d e x

Author L. E. K. Achenie

C. S. Adjiman A. Apostolakou E. A. Brignole A. Buxton J. M. Caruthers M. Cismondi J. L. Cordiner R. Gani P. M. Harper M. Hostrup A. Hugo A. G. Livingston G. M. Ostrovski P. Patkar E. N. Pistikopoulos M Sinha A. Sundaram V. Venkatasubramanian

J. M. Vinson

Page 3,43,247,261, 357 63,289 63,289 23 167,319 329 23 229 3, 129, 357 129 129 167,319 167,319 43,247 95,303 167,319 43,247,261 329 3,95,303,329, 357 211

computer aided molecular design: theory and practice, volume 12 (computer aided chemical...

Documents