1 chapter 10 nonlinear programming methods. 2 background solution techniques for nonlinear...
TRANSCRIPT
1
Chapter 10
Nonlinear Programming Methods
2
Background
• Solution techniques for nonlinear programming (NLP) are much more complex and much less effective than those for linear programming (LP).
• Linear programming codes will provide optimal solutions for problems with hundreds of thousands of variables, but there is a reasonable chance that an NLP code will fail on a problem containing only a handful of variables.
• To sharpen this contrast, recall that all interior point methods for solving LP problems include ideas originally developed to solve NLP problems.
3
10.1 CLASSICAL OPTIMIZATION The simplest situation that we address concerns the minimization o
f a function f in the absence of any constraints. This problem can be written as
Minimize {f(x): x Rn}
Where f C2 (twice continuously differentiable). Without additional assumptions on the nature of f, we will most likely have to be content with finding a point that is a local minimum. Elementary calculus provides a necessary condition that must be true for an optimal solution of a nonlinear function with continuous first and second derivatives.
gradient is zero at every stationary point that is a candidate for a maximum or minimum.
Sufficient conditions derived from convexity properties are also available in many cases.
4
Unconstrained OptimizationThe first-order necessary condition that any point x* must satisfy
to be a minimum of f is that the gradient must vanish.▽f(x*)=0 (2)
This property is most easily illustrated for a univariate objective function in which the gradient is simply the derivative or the slope of f(x).
Consider, for example, Figure 10.1. The function in part (a) has a unique global minimum x* at which the slope is zero. Any movement from that point yields a greater, and therefore less favorable, value. The graph in part (b) exhibits a range of contiguous global minima where the necessary condition holds; however, we should note that the corresponding f(x) is not twice continuously differentiable at all points.
5
Figure 10.2 shows why Equation (2) is only a necessary condition and not a sufficient condition. In all three parts of the figure there are points at which the slope of f(x) is zero but the global minima are not attained. Figure 10.2a illustrates a strong local maximum at x1*and a strong local minimum
at x2*. Figure 10.2b shows a point of inflection at x1* that i
s a one-dimensional saddle point. Finally, Figure 10.2c presents the case of a unique global maximum at x1*.
The ideas embodied in Figures 10.1 and 10.2 can be easily generalized to functions in a higher-dimensional space at both the conceptual and mathematical levels. Because the necessary condition that the gradient be zero ensures only a stationary point—ie., a local minimum, a local maximum, or a saddle point at x*.
6
7
Sufficient conditions for x* to be either a local or a global
minimum : • If f(x) is strictly convex in the neighborhood of x*, then
x* is a strong local minimum.
• If f(x) is convex for all x, then x* is a global minimum.
• If f(x) is strictly convex for all x, then x* is a unique
global minimum.
To be precise, a neighborhood of x is an open sphere centered at x with arbitrarily small radius ε > 0. It is denoted by Nε(x), where Nε(x) = {y : (y - x) < ε}. ∥ ∥
8
9
f(x) is strictly convex if its Hessian matrix H(x) is positive definite for all x. In this case, a stationary point must be a unique global minimum.
f(x) is convex if its Hessian matrix H(x) is positive semidefinite for all x. For this case a stationary point will be a global (but perhaps not unique) minimum.
If we do not know the Hessian for all x, but we evaluate H(x*) at a stationary point x* and find it to be positive definite, the stationary point is a strong local minimum.
(If H(x*) is only positive semidefinite at x* , x* can not be
guaranteed to be a local minimum.)
10
Functions of a Single Variable
Let f(x) be a convex function of x .
A necessary and sufficient condition for x* to be a global minimum is that the first derivative of f(x) be zero at that point.This is also a necessary and sufficient condition for the maximum of a concave function. The optimal solution is determined by setting the derivative equal to zero and solving the corresponding equation for x. If no solution exists, there is no finite optimal solution.
A sufficient condition for a local minimum (maximum) point of an arbitrary function is that the first derivative of the function be zero and the second derivative be positive (negative) at the point.
1R
11
Example 1 Let us find the minimum of f(x) = 4x2 - 20x + 10. T
he first step is to take the derivative of f(x) and set it equal to zero.
d f(x)/dx=8x-20=0 Solving this equation yields x* = 2.5 , which is a
candidate solution. Looking at the second derivative, we see
d2 f(x)/dx2=8>O for all x so f is strictly convex. Therefore, x* is a global minimum.
12
Example 2
As a variation of Example 1, let us find the minimum of f(x) = -4x2 - 20x.
Taking the first derivative and setting it equal to zero yields df(x) /dx = -8x - 20 = 0, so x* = -2.5. The second derivative is d2f(x) /dx = -8 < 0 for all x, so f is strictly concave.
This means that x* is a global maximum.
There is no minimum solution because f(x) is unbounded from below.
13
Example 3 Now let us minimize the cubic function f(x) = 8x3 + 15x2 + 9x
+ 6. Taking the first derivative and setting it equal to zero yields df(x) /dx = 24 x2 + 30x + 9 = (6x + 3)(4x + 3) = 0. The roots of this quadratic are at x = -0.5 and x = -0.75, so we have two candidates. Checking the second derivative
d2f(x) /dx2 = 48x + 30 we see that it can be > 0 or < 0. Therefore, f(x) is neither convex nor concave. At x = -0.5, d2f(-0.5) /dx2 = 6, so we have a local minimum. At x = -0.75, d2f(-0.75) /dx2 = -6, which indicates a local maxim
um.
These points are not global optima, because the function is actually unbounded from both above and below.
14
Functions of Several Variables
Theorem 1:
Let f(x) be twice continuously differentiable throughout a neighborhood of x*. Necessary conditions for x* to be a local minimum of f are
a. ▽f(x*)=0
b. H(x*) is positive semidefinite.
15
Theorem 2:
Let f(x) be twice continuously differentiable throughout a neighborhood of x*. Then a sufficient condition for f(x) to have a strong local minimum at x*, where Equation (2) holds, is that H(x*) be positive definite.
Note:
H(x*) being positive semidefinite is not a sufficient condition for f(x) to have a local minimum at x*.
16
Quadratic Forms A common and useful nonlinear function is the quadratic
function
that has coefficients a , c , and Q .
Q is the Hessian matrix of f(x). Setting the gradient
▽
to zero results in a set of n linear equations in n variables. A solution will exist whenever Q is nonsingular. In such instances, the stationary point is
x*= -Q-1cT
1R nR nnR *
Qxxcxaxf T
2
1)(
Qxcxf T )(
17
For a two-dimensional problem, the quadratic function is
For this function, setting the partial derivatives with respect to x1, and x2 equal to zero results in the following linear
system.
21122222
21112211 2
1
2
1)( xxqxqxqxcxcaxf
0
0
2221122
2121111
xqxqc
xqxqc
18
These equations can be solved using Cramer's rule from linear algebra. The first step is to find the determinant of the Q matrix. Let
The appropriate substitutions yield
which is the desired stationary point.
2122211
2212
1211 )(det qqqqq
qqQ
Q
qcqcxand
Q
qcqcx
detdet121112*
2122221*
1
19
When the objective function is a quadratic, the determination of definiteness is greatly facilitated because the Hessian matrix is constant.
For more general forms, it may not be possible to determine conclusively whether the function is positive definite, negative definite, or indefinite. In such cases, we can only make statements about local optimality.
In the following examples, we use H to identify the Hessian.
For quadratic functions, Q and H are the same.
20
Example 4Find the local extreme values of
Solution: Using Equation (2) yields
50 X1 - 20 = 0 and 8 X2 + 4 = 0
The corresponding stationary point is x* = (2/5, -1/2). Because f(x) is a quadratic, its Hessian matrix is constant.
The determinants of the leading submatrices of H are H1 = 50
and H2 = 400, so f(x) is strictly convex, implying that x* is
the global minimum.
5420425)( 2122
21 xxxxxf
80
050H
21
Example 5Find the local extreme values of the nonquadratic function
Solution: Using Equation (2) yields
▽f(x)=(9x
12 –9, 2x2 +4) T =(0, 0) T
So x1 = ±1 and x2= -2. Checking x = (1, -2), we have
2122
31 493)( xxxxxf
20
018)2,1(H
22
which is positive definite since vT H(l, -2)v =18 v12 + 2v2
2 > 0
when v≠0. Thus (1, -2) yields a strong local minimum.
Next, consider x = (-1, -2) with Hessian matrix
Now we have vT H(-l, -2)v =-18 v12 + 2v2
2, which may be less
than or equal to 0 when v≠0. Thus, the sufficient condition for (-1, -2) to be either a local minimum or a local maximum is not satisfied. Actually, the second necessary condition (b) in Theorem 1 for either a local minimum or a local maximum is not satisfied. Therefore, x = (1, -2) yields the only local extreme value of f.
20
018)2,1(H
23
Example 6 Find the extreme values of f(x) = -2x1
2+ 4x1 x2-4x22 + 4x1 +
4x2 +10.
Solution: Setting the partial derivatives equal to zero leads to the linear system
-4 x1 + 4 x2+ 4 = 0 and 4 x1 -8 x2+ 4 = 0
which yields x* = (3, 2). The Hessian matrix is
Evaluating the leading principal determinants of H, we find H1= -4 and H2 = 16. Thus, f(x) is strictly concave and x*
is a global maximum.
84
44H
24
Nonquadratic Forms When the objective function is not quadratic (or linear), the
Hessian matrix will depend on the values of the decision variables x.
Suppose
The gradient of this function is
21
2212 )1()()( xxxxf
)(2
)1(2)(4)(
212
12121
xx
xxxxxf
25
For the second component of the gradient to be zero , we must have x2= . Taking this into account, the first component is ze
ro only when x1= 1, so x* = (1,1) is the sole stationary point.
It was previously shown (in Section 9.3) that the Hessian matrix H(x) at this point is positive definite, indicating that it is a local minimum. Because we have not shown that the function is everywhere convex, further arguments are necessary to characterize the point as a global minimum.
Logically, f(x)≧0 because each of its two component terms is squared. The fact that f(1,1) = 0 implies that (1,1) is a global minimum.
As a further example, consider
21x
)3)(2()( 221
221 xxxxxf
26
where
A stationary point exists at x* = (0, 0). Also, H1 = 2 and H2
= 44x22 - 20x1 implying that H(x) is indefinite. Although H
(x) is positive semidefinite at (0,0) this does not allow us to conclude that x* is a local minimum. Notice that f(x) can be made arbitrarily small or large with the appropriate choices of x.
These last two examples suggest that for nonquadratic functions of several variables, the determination of the character of a stationary point can be difficult even when the Hessian matrix is semidefinite. Indeed, a much more complex mathematical theory is required for the general case.
1
222
2
3221
221
107210
102)(
2410
52)(
xxx
xxHand
xxx
xxxf
27
Summary for Unconstrained Optimization
Table 10.1 summarizes the relationship between the optimality of a stationary point x* and the character of the Hessian evaluated at x*. It is assumed that f(x) is twice differentiable and f(x*) = 0.▽
If H(x) exhibits either of the first two definiteness properties for all x, then "local" can be replaced with "global" in the associated characterizations. Furthermore, if f(x) is quadratic, a positive semidefinite Hessian matrix implies a nonunique global minimum at x*.
28
Notice that although convexity in the neighborhood of x* is sufficient to conclude that x* is a weak local minimum, the fact that H(x*) is positive semidefinite is not sufficient, in general, to conclude that f(x) is convex in the neighborhood of x*.
29
When H(x*) is positive semidefinite, it is possible that points in a small neighborhood of x* can exist such that f(x) evaluated at those points will produce smaller values than f(x*). This would invalidate the conclusion of convexity in the neighborhood of x*.
As a final example in this section, consider
for which
333
22
331
321
22
21
31 35242)( xxxxxxxxxxxf
23
22
231
32221
21
33
32
221
21
915
268
5286
)(
xxxx
xxxxx
xxxxx
xf
30
• and
Looking at the Hessian matrix, it is virtually impossible to make any statements about the convexity of f(x). This gives us a glimpse of the difficulties that can arise when one attempts to solve unconstrained nonlinear optimization problems by directly applying the classical theory. In fact, the real value of the theory is that it offers insights into the development of more practical solution approaches. Moreover, once we have a stationary point x* obtained from one of those approaches, it is relatively easy to check the properties of H(x*), because only numerical evaluations are required.
331223
232121
2221
23
2221
221
1830215
22128616
15616812
)(
xxxxx
xxxxxxxx
xxxxxx
xH
31
A Taylor series is a series expansion of a function about a point. A one-dimensional Taylor series is an expansion of a real function about a point a is given by
If a=0, the expansion is known as a Maclaurin series.
32
Taylor expansion of f(x) at x0 is:
Note:
baxxforxRxPxf nn ,,, 0
n
k
kk
n xxk
xfxP
00!
0
|0 xxk
kk f
dx
dxf
1
0
1
!1
nn
n xxn
xfxR
0
...)(!4
)()(
!3
)()(
!2
)()()()()( 0
)4(4
00
30
0
20
000
xfxx
xfxx
xfxx
xfxxxfxf
33
Nonnegative Variables A simple extension of the unconstrained optimization
problem involves the addition of non-negativity restrictions on the variables.
Minimize [f(x):x 0] (3)≧
Suppose that f has a local minimum at x*, where x*≧0. Then there exists a neighborhood Nε(x*) of x* such that
whenever x Nε(x*) and x > 0, we have f(x) f(x*). Now ≧write x = x* + td, where d is a direction vector and t > 0. Assuming that f is twice continuously differentiable throughout Nε(x*), a second-order Taylor series expansion
of f(x* + td) around x* yields
34
• where α [0,1]. Canceling terms and dividing through by t yields
As t —> 0, the inequality becomes 0 f(x*)d, which says that ≦▽f must be nondecreasing in any feasible direction d.
Hence, if x* > 0, we know that f(x*) = 0. ▽With a bit more analysis, it can be shown that the following condit
ions are necessary for x* to be a local minimum of f(x).
* * * * 2 *( ) ( ) ( ) ( ) ( ) ( )2
Ttf x f x f x td f x f x td d f x td td
* 2 *0 ( ) ( )2
Ttf x d d f x td d
35
These results are summarized as follows.
Theorem 3: Necessary conditions for a local minimum of f in Problem (3) to occur at x* include
▽f(x*) 0, f(x*)x*=0, x* 0≧ ▽ ≧ (4)
where f is twice continuously differentiable throughout a n
eighborhood of x*.
0,0)(
0,0)(
**
**
jj
jj
xifx
xf
xifx
xf
36
Example 8 Minimize f(x) =
subject to x1 0, ≧ x2≧0, x3 0≧ Solution: From Conditions (4), we have the following necessary
conditions for a local minimum.
1312123
22
21 2223 xxxxxxxx
22260 3211
xxxx
f
)2226(0 32111
1
xxxxx
fx
122
220 xxx
f
)22(0 1222
2 xxxx
fx
133
220 xxx
f
)22(0 1333
3 xxxx
fx
a.
b.
c.
d.
e.
f.
g.
0,0,0 321 xxx
37
From condition (d), we see that either x2 = 0 or x1 =
x2. When x2 = 0, conditions (c) and (g) imply that
x1 = 0. From condition (f) then, x3 = 0. But this
contradicts condition (a),
x2 ≠0 and x1 = x2.
38
Condition (f) implies that either x3 = 0 or x1 = x3. If x3 = 0, then conditions (d), (e), and (g) imply that x1= x2 = x3 = 0. But this situation has been ruled out. Thus, x1= x2 = x3 , and from condition (b) we get x1 = 0 or x1 = 1. Since x1≠0, the only possible relative minimum of f occurs when x1= x2 =x3= 1. To characterize the solution at x* = (1, 1, 1) we evaluate the Hessian matrix.
which is easily shown to be positive definite. Thus, f is strictly
convex and has a strong local minimum at x*. It follows from Theorem 2 in Chapter 9 that f(x*) = 1 is a global minimum.
202
022
226
H
39
Necessary Conditions for OptimalityEquality constraints : Minimize f(x) subject to gi(x) = 0, i = 1,..., m (5)
The objective and constraint functions are assumed to be at least twice continuously differentiable.
Furthermore, each of the gi(x) subsumes the constant term bi.
To provide intuitive justification for the general results, consider the special case of Problem (5) with two decision variables and one constraint—i.e.,
Minimize f(x1, x2)
subject to g(x1, x2)=0
40
To formulate the first-order necessary conditions, we construct the Lagrangian
here λ is an unconstrained variable called the Lagra
nge multiplier. Our goal now is to minimize the unconstrained func
tion . As in Section 10.1, we construct the gradient of the Lagrangian with respect to its decision variables x1 and x2 and the multiplier λ. Setting the gradient equal to zero, we obtain
),(),(),,( 212121 xxgxxfxx
41
(6)
which represents three equations in three unknowns. Using the first two equations to eliminate λ, we have
which yields a stationary point x* and λ* when solved.
From Equation (6), we see that ▽f(x1 , x2)and ▽g(x1 , x2) are
coplanar at this solution ,i.e., ▽f(x1 , x2)= λ ▽g(x1 , x2) .
0
0
0
),(
),(),(
),(),(
),,(
21
2
21
2
21
1
21
1
21
21
xxgx
xxg
x
xxfx
xxg
x
xxf
xx
1 21 2 2 1
0, , 0 f g f g
g x xx x x x
42
It is a simple matter to extend these results to the general case. The Lagrangian is
where λ = ( ,..., ) is an m-dimensional row vector. Here, every constraint has an associated unconstrained multiplier . Setting the partial derivatives of the Lagrangian with respect to each decision variable and each multiplier equal to zero yields the following system of n + m equations. These equations represent the first-order necessary conditions for an optimum to exist at x*.
(7a) (7b)
m
iii xgxfx
1
)()(),(
m
i j
ii
jj
njx
xg
x
xf
x 1
,...,1,0)()(
1 mi
mixgii
,...,1 ,0)(
43
A solution to Equations (7a) and (7b) yields a stationary point (x*, λ*); however, an additional qualification must be placed on the constraints in Equation (7b) if these conditions are to be valid.
The most common qualification is that the gradients of the binding constraints are linearly independent at a solution.
Because Equations (7a) and (7b) are identical regardless of whether a minimum or maximum is sought, additional work is required to distinguish between the two.
Indeed, it may be that some selection of the decision variables and multipliers that satisfies these conditions determines a saddle point of f(x) rather than a minimum or maximum.
44
Example 10Minimize f(x) = (x1 + x2)2 subject to - (x1 – 3)3 +x2
2= 0The Lagrangian is
Now, setting partial derivatives equal to zero gives three highly nonlinear equations in three unknowns:
22
31
221 )3()(),( xxxxx
21 2 1
1
1 2 22
2( ) 3 ( 3) 0,
2( ) 2 0
x x xx
x x xx
0)3( 22
31
xx
45
The feasible region is illustrated in Figure 10.3.
Notice that the two parts of the constraint corresponding to the positive and negative values of x2 form a cusp. At the endpoint (3,0), the second derivatives are not continuous, foreshadowing trouble.
In fact, x = (3, 0) is the constrained global minimum, but on substitution of this point into the necessary conditions, we find that the first two equations are not satisfied.
Further analysis reveals that no values of x1 , x2, and λ will satisfy all three equations. (Constraint qualification is not satisfied.)
46
The difficulty is that the constraint surface is not smooth, implying that the second derivatives are not everywhere continuous. Depending on the objective function, when such a situation arises the first-order necessary conditions [Equations (7a) and (7b)] may not yield a stationary point.
47
INEQUALITY CONSTRAINTS :The most general NLP model that we investigate is
Minimize f(x)
subject to
where an explicit distinction is now made between the equality and inequality constraints. In the model, all functions are assumed to be twice continuously differentiable, and any RHS constants are subsumed in the corresponding functions hi(x) or gi (x). Problems with a maximization objective
or constraints can easily be converted into the form of a≧bove problem. Although it is possible and sometimes convenient to treat variable bounds explicitly, we assume that they are included as a subset of the m inequalities.
( ) 0, 1,...,
( ) 0, 1,...,i
i
h x i p
g x i m
48
Karush-Kuhn-Tucker (KKT) Necessary Conditions
To derive first- and second-order optimality conditions for
this problem, it is necessary to suppose that the constraints satisfy certain regularity conditions or constraint qualifications, as mentioned previously.
The accompanying results are important from a theoretical point of view but less so for the purposes of designing algorithms. Consequently, we take a practical approach and simply generalize the methodology used in the developments associated with the equality constrained Problem (5).
49
In what follows, let h(x) = (h1(x),..., hP(x))T and g(x) = (g1
(x),..., gm(x)) T. For each equality constraint we define an u
nrestricted multiplier, λi, i= 1,..., p, and for each inequality
constraint we define a nonnegative multiplier,μi, i = 1,..., m.
Let X, λ and μ be the corresponding row vectors. This leads to the Lagrangian for Problem (8).
Definition 1: Let x* be a point satisfying the constraints h(x*) = 0, g(x*) ≦0 and let K be the set of indices k for which gk(x*) = 0. Then x* is said to be a regular point of these co
nstraints if the gradient vectors ▽hj(x*) (1 ≦ j≦p), ▽gk
(x*) (k K) are linearly independent (equality part).
1 1
( , , ) ( ) ( ) ( )p m
i i i ii i
x f x h x g x
50
Theorem 4 (Karush-Kuhn-Tucker Necessary Conditions): Let x* be a local minimum for Problem (8) and suppose that x* is regular point for the constraints. Then there exists a vector and a vector μ* such that
(9a)
(9b)
(9c)
(9d)
(9e)
p* m
njx
xg
x
xh
x
xf
x j
im
ii
j
ip
ii
jj
,...,10)()()( *
1
**
1
**
pixhii
,...,1,0)( *
mixgii
,...,1,0)( *
mixgii ,...,1,0)( **
mii ,...,1,0*
51
Constraints (9a) to (9e) were derived in the early 1950s and are known as the Karush-Kuhn-Tucker (KKT) conditions in honor of their developers. They are first-order necessary conditions and postdate Lagrange's work on the equality constrained Problem (5) by 200 years.
The first set of equations [Constraint (9a)] is referred to as the stationary conditions and is equivalent to dual feasibility in linear programming.
Constraints (9b) and (9c) represent primal feasibility, and Constraint (9d) represents complementary slackness.
Nonnegativity of the "dual" variables appears explicitly in Constraint (9e).
52
In vector form, the system can be written as
▽f(x*) +λ*▽h(x*) + u*▽g(x*) = 0
h(x*)=0,
g(x*) 0,≦ μ*g(x*)=0,
μ*≧0
53
For the linear program, the KKT conditions are necessary and sufficient for global optimality. This is a result of the convexity of the problem and suggests the following, more general result.
Theorem 5 (Karush-Kuhn-Tucker Sufficient Conditions): For Problem (8), let.f(x) and gi(x) be convex, i = 1,..., m, and
let hi(x) be linear, i = 1,..., p. Suppose that x* is a regular point for the constraints and that there exist a λ* and aμ* such that (x*, λ*, μ*) satisfies Constraints (9a) to (9e). Then x* is a global optimal solution to Problem (8). If the convexity assumptions on the objective and constraint functions are restricted to a neighborhood Nε(x*) for some ε > 0, the
n x* is a local minimum of Problem (8). (If we are maximizing f(x), f(x) must be concave.)
54
Sufficient Conditions The foregoing discussion has shown that under certain con
vexity assumptions and a suitable constraint qualification, the first-order KKT conditions are necessary and sufficient for at least local optimality. Actually, the KKT conditions are sufficient to determine if a particular solution is a global minimum if it can be shown that the solution (x*, λ*, μ*)
is a saddle point of the Lagrangian function. (The other case where KKT is sufficient.)
Definition 2; The triplet (x*, λ*, μ*) is called a saddle point of the Lagrangian function if μ* 0 and≧
for all x and λ, and μ ≧ 0
),,(),,(),,( ****** xxx
55
Hence, x* minimizes over x when (λ, μ) is fixed at (λ*, μ*), and (λ*, μ*) maximizes over (λ, μ) with μ 0 when ≧x is fixed at x*. This leads to the definition of the dual problem in nonlinear programming.
Lagrangian Dual: Maximize (10)
where
When all the functions in Problem (8) are linear, Problem (10) reduces to the familiar LP dual.
In general, Ψ(λ, μ) is a concave function; for the LP it is piecewise linear as well as concave.
{ ( , ) : , 0}free
)}()()({),( xgxhxfMinx
56
Theorem 6 (Saddle Point Conditions for Global Minimum): A solution (x*, λ*, μ*) with μ* 0 is a saddle point of the ≧Lagrangian function if and only if
• a. x* minimizes (x, λ *, μ*)
• b. g(x*)<0, h(x*)=0
• c. μ*g(x*)=0
Moreover, (x*, λ*, μ*) is a saddle point if and only if x* solves Problem (8) and (λ*, μ*) solves the dual Problem (10)
with no duality gap—that is,f(x*) = Ψ(λ*, μ*).
57
Under the convexity assumptions in Theorem 4, the KKT conditions are sufficient for optimality. Under weaker assumptions such as nondifferentiability of the objective function, however, they are not applicable. Table 10.2 summarizes the various cases that can arise and the conclusions that can be drawn from each.
58
Example 11
Use the KKT conditions to solve the following problem.
Minimize f(x) = 2(x1 + 1)2 + 3(x2- 4)2
subject to
Solution: It is straightforward to show that both the objective function and the feasible region are convex. Therefore, we are assured that a global minimum exists and that any point x* that satisfies the KKT conditions will be a global minimum. Figure 10.4 illustrates the constraint region and the isovalue contour f(x) = 2.
The partial derivatives required for the analysis are
2,9 2122
21 xxxx
1,2,)4(6
1,2,)1(4
2
22
2
12
2
1
21
1
11
1
x
gx
x
gx
x
fx
gx
x
gx
x
f
59
Note that we have rewritten the second constraint as a ≦constraint prior to evaluating the partial derivatives. Based on this information, the KKT conditions are as follows.
a.
b.
c.
d.
0)2()4(6,0)2()1(4 22122111 xxxx
02,09 2122
21 xxxx
0)2(,0)9( 21222
211 xxxx
0,0 21
60
61
Explicit Consideration of Nonnegativity Restrictions
Nonnegativity is often required of the decision variables. When this is the case, the first-order necessary conditions listed as Constraints (9a) to (9e) can be specialized in a way that gives a slightly different perspective.
Omitting explicit treatment of the equality constraints, the problem is now
Minimize {f(x):gi(x) 0, ≦ i=1,…,m ; x ≧ 0}
62
The Karush-Kuhn-Tucker conditions for a local minimum are as follows.
(11 a)
(11 b)
(11 c)
(11 d)
(11 e)
**
1
( )0, 1,...,
mi
iij j j
gf xj n
x x x
mixgii
,...,1,0)( *
njx
xi
j ,...,1,0
mixgii ,...,1,0)( **
minjx ij ,...,1,0;,...,1,0 **
63
Example 12
Find a point that satisfies the first-order necessary conditions for the following problem.
Minimize f(x) =
subject to x1 + x2 5, ≦ x1 0, ≧ x2 0≧Solution: We first write out the Lagrangian function excludi
ng the nonnegative conditions. The specialized KKT conditions [Constraints (11 a) to (l1e)] are
a. 2 X1-8+μ 0, 8 X≧ 2-16+μ 0 ≧b. X1+ X2-5 0≦c. X1(2 X1- 8 + μ )= 0, X2(8 X2 - 16 + μ) = 0
d. μ(X1 + X2 - 5) = 0
e. X1 0, X≧ 2 0, μ 0≧ ≧
2 21 2 1 24 8 16 32x x x x
2 21 2 1 2 1 2 1 2( , , ) 4 8 16 32 ( 5)x x x x x x x x
64
Let us begin by examining the unconstrained optimal solution x = (4,2). Because both primal variables are nonzero at this point, condition (c) requires that μ = 0. This solution satisfies all the constraints except condition (b). Primal feasibility, suggesting that the inequality X1 +
X2 5 is binding at the optimal solution. Let us further ≦suppose that x > 0 at the optimal solution. Condition (c) then requires 2 X1 - 8 + μ = 0 and 8 X2 - 16 + μ = 0.
Coupled with X1 + X2 = 5, we have three equations in three
unknowns. Their solution is x = (3.2,1.8)and = 1.6, which satisfies Constraints (1la) to (l1e) and is a regular point. Given that the objective function is convex and the constraints are linear, these conditions are also sufficient. Therefore, x* = (3.2,1.8) is the global minimum.
65
SummaryNecessary conditions for local minimum: • Unconstrained problems (Min f(x) ): a. f(x*)=0 ▽ b. H(x*) is positive semidefinite.
• Min f(x), s.t. x 0≧ : a. f(x*) 0▽ ≧ b. f(x*)x*=0▽ c. x* 0≧
66
• Min f(x), s.t. h(x)=0: a. f(x*)+λ h(x*) =0 ▽ ▽ b. h(x*) =0.
• Min f(x), s.t. h(x)=0, g(x) 0≦ : a. ▽f(x*) +λ*▽h(x*) + u*▽g(x*) = 0 b. h(x*)=0 c. g(x*) 0≦ d. μ*g(x*)=0 e. μ* 0≧
67
• Min f(x), s.t. g(x) 0, x ≦ ≧ 0 : a. ▽f(x*) +u*▽g(x*) ≧ 0 b. g(x*) 0≦ c. x*{▽f(x*) +u*▽g(x*) }=0 d. μ*g(x*)=0 e. μ* 0≧ f. x* ≧ 0
68
10.4 SEPARABLE PROGRAMMING
Problem Statement
Consider the general NLP problem Minimize {f(x): gi(x) b≦ i ,
i= 1,..., m} with two additional provisions: (1) the objective function and all constraints are separable, and (2) each decision variable Xj is bounded below by 0 and above by a known constan
t μj,j = 1,..., n.
Recall that a function f(X) is separable if it can be expressed as t
he sum of functions of the individual decision variables.
n
jjj xfxf
1
)()(
69
The separable NLP has the following structure.
Minimize
subject to
0< Xj < μj j=1,..., n
The key advantage of this formulation is that the nonlinearities are mathematically independent. This property, in conjunction with the finite bounds on the decision variables, permits the development of a piecewise linear approximation for each function in the problem.
n
jjj xfxf
1
)()(
mibxg ij
n
jij ,...,1,)(
1
70
Consider the general nonlinear function f(X) depicted in
Figure 10.5. To form a piecewise linear approximation using, say, r line segments, we must first select r + 1 values of the scalar x within its range 0≦x≦μ (call them , , ..., ) and let f k=f( )for k = 0,1, ..., r. At the boundaries we have
= 0 and = u. Notice that the values of , ,…, do not have to be evenly spaced.
0x 1x rx
kx0xrx
0x1x rx
71
Recall that any value of x lying between the two endpoints of the kth line segment may be expressed as
where (k = 0,1,..., r) are data and α is the decision variable. This relationship leads directly to an expression for the kth line segment.
1 1 (1 ) or ( )
0 1
k k k kkx x x x x x x
for
^1
11
( ) ( ) (1 ) 0 1k kkk k k
k k
f ff x f x x f f for
x x
kx
72
The approximation (x) becomes increasingly more accurate as r gets larger. Unfortunately, there is a corresponding growth in the size of the resultant problem.
For the kth segment, let α =αk+1 and let (1 - α) = αk. As such,
for the expression for x becomes
where αk +αk+1 = 1 and αk 0, α≧ k+1 0. ≧ Generalizing this procedure to cover the entire range over w
hich x is defined yields
^
f
kkkkkkkk ffxfandxxx 11
^
11 )(
r
k
r
kkkkk
r
k
kk rkfxfxx0 0
^
0
,...,1,0,1,)(,
1 kk xxx
73
such that at least one and no more than two αk can be greater than
zero.
Furthermore, we require that if two αk are greater than zero, their
indices must differ by exactly 1. In other words, if αs is greater
than zero, then only one of either αs+1 or αs-1 can be greater than
zero.
If this last condition, known as the adjacency criterion, is not
satisfied, the approximation to f(x)will not lie on (x).
The separable programming problem in x becomes the following "almost" linear program in α.
^
f
74
n
j
r
kjkjkjk
j
xf1 0
)(Minimize f(α)=
Subject to
αjk 0, ≧ j=1,…,n k=0,…, rj
n
j
r
kjkijkjki
j
mibxgg1 0
,...,1,)()(
jr
kjk nj
0
,...,1,1
75
Example 13 Consider the following problem, whose feasible region is shown graphically
in Figure 10.6. All the functions are convex, but the second constraint is g2(x) 10. Because g≧ 2 (x) is not linear, this implies that the feasible region is not convex, and so the solution to the approximate problem may not be a global optimal solution.
Minimize f(x) = subject to
0≦x1 1.75 ,0 ≦ ≦x2 1.5≦
2121 232 xxx
10)2(5)2(3)( 22
212 xxxg
21)2(5)2(3)( 22
213 xxxg
843)( 22
211 xxxg
76
77
The upper bounds on the variables have been selected to be redundant. The objective function and constraints are separable, with the individual terms being identified in Table 10.3.
To develop the piecewise linear approximations, we select six grid points for each variable and evaluate the functions at each point. The results are given in Table 10.4. For this example, n=2, m=3, r1=5, and r2 = 5. As an illustration, the pi
ecewise linear approximations of f1(x) and g12(x), along wit
h the original graphs, are depicted in Figure 10.7. The full model has five constraints and 12 variables. The coefficient matrix is given in Table 10.5 where the last two rows correspond to the summation constraints on the two sets of a variables.
78
The problem will be solved with a linear programming code modified to enforce the adjacency criterion. In particular, for the jth variable we do not allow an αjk variable to enter the basis unless αj,k-1 or αj,k+1 is already in the basis, or no αj,k {k = 0,1,..., 5) is currently basic. The following slack and artificial variables are used to put the problem into standard simplex form.
• s1 = slack for constraint 1, g1
• s2 = surplus for constraint 2, g2
• a2 = artificial for constraint 2, g2
• s3 = slack for constraint 3, g2
• a4 = artificial for constraint 4,
• a5 = artificial for constraint 5,
The initial basic solution is
Xg = (s1, a2, s3, a4 , a5) = (8,10,21,1,1)
79
80
81
QUADRATIC PROGRAMMING
A linearly constrained optimization problem with a quadratic objective function is called a quadratic program (QP). Because of its many applications, quadratic programming is often viewed as a discipline in and of itself. More importantly, however, it forms the basis for several general NLP algorithms. We begin by examining the Karush-Kuhn-Tucker conditions for the QP and discovering that they turn out to be a set of linear equalities and complementary constraints. Much as for the separable programming problem, a modified version of the simplex algorithm can be used to find solutions.
82
Problem Statement The general quadratic program can be written as Minimize f(x) = cx + subject to Ax b and x 0≦ ≧ where c is an n-dimensional row vector describing the coe
fficients of the linear terms in the objective function and Q is an (n×n) symmetric matrix describing the coefficients c the quadratic terms; If a constant term exists, it is dropped from the model.
As in lines programming, the decision variables are denoted by the n-dimensional column vector x and the constraints are defined by an (m x n) A matrix and an m-dimensional column vector b of RHS coefficients. We assume that a feasible solution exists and that the constrain region is bounded.
QxxT
2
1
83
Karush-Kuhn-Tucker Conditions We now adapt the first-order necessary conditions given in
Section 10.3 to the quadratic program. These conditions are sufficient for a global minimum when Q is positive definite ; otherwise, the most we can say is that they are necessary.
Excluding the nonnegativity conditions, the Lagrangian function for the Quadratic program is
1( , ) ( )
2Tx cx x Qx Ax b
84
where μ is an m-dimensional row vector. The KKT conditions for a local minimum are as follows.
0, 1,..., 0 (12 )
0, 1,..., 0 (12 )
0, 1,..., ( ) 0 (12 )
( ) 0, 1,..., ( ) 0 (12 )
0, 1,..., 0 (12 )
0, 1,..., 0 (12 )
T
j
i
T T Tj
j
i i
j
i
j n c x Q A ax
i m Ax b b
x j n x c Qx A cx
g x i m Ax b c
x j n x d
i m e
85
To put Conditions (12a) to (12f) into a more manageable form, we introduce nonnegative surplus variables y to the inequalities in Condition (12a) and nonnegative slack variables v to the inequalities in Condition (12b) to obtain the equations
CT + Qx + AT μT-y=O and Ax-b+v=0
The KKT conditions can now be written with the constants
moved to the right-hand-side Qx + A TμT - y = -CT (13a)
Ax + v = b (13b)
x O,μ 0,y 0,v 0≧ ≧ ≧ ≧ (13c)
yTx=O,μv= O (13d)
n
86
Solving for the Optimal Solution The simplex algorithm can be used to solve Equations (13a)
to (13d) by treating the complementary slackness conditions [Equation (13d)] implicitly with a restricted basis entry rule. The procedure for setting up the LP model follows.
• Let the structural constraints be Equations (13a) and (13b) defined by the KKT conditions.
• If any of the RHS values are negative, multiply the corresponding equation by -1.
• Add an artificial variable to each equation.
• Let the objective function be the sum of the artificial variables.
• Convert the resultant problem into simplex form.
87
Example 14
Solve the following problem.
Minimize f(x) = -8 x1 -16 x2 +
subject to x1 + x2 5, ≦ x1 3, ≦ x1 0≧ , x2 0≧
22
21 4xx
88
Solution: The data and variable definitions are given below. As we can see, the Q matrix is positive definite, so the KKT conditions are necessary and sufficient for a global optimal solution.
8 2 0 1 1 5, , ,
16 0 8 1 0 3Tc Q A b
1 2 1 2 1 2 1 2( , ) , ( , ) , ( , ) ( , )T T TX x x y y y v v v
89
The linear constraints [Equations (13a) and (13b)] take the following form.
2x1 +μ1+μ2–y1 =8
8x2+μ1 –y2 =16
x1 + x2 +v1 =5
x1 +v2 = 3
90
To create the appropriate linear program, we add artificial variables to each constraint and minimize their sum.
Minimize a1 + a2+ a3 + a4
subject to 2x1 +μ1+μ2–y1 + a1 =8
8x2+μ1 –y2 + a2 =16
x1 + x2 +v1 +a3 =5
x1 +v2 +a4 = 3
All variables 0≧ and subject to complementary conditions.
91
10.6 ONE-DIMENSIONAL SEARCH METHODS
The basic approach to solving almost all mathematical programs in continuous variable is to select an initial point x° and a direction d° in which the objective function is improving, and then move in that direction until either an extremum is reached or a constraint i violated. In either case, a new direction is computed and the process is repeated. A check for convergence is made at the end of each iteration. At the heart of this approach is a one dimensional search by which the length of the move, called the step size, is determined. That is, given a point xk and a direction dk at iteration k, the aim is to find an optimal step size tk that moves us to
the next point xk+1 =xk+ tk dk.
92
Unimodal Functions Out of practical considerations, we define an interval of uncertain
ty [a, b] in which the minimum of f(x) must lie. This leads to the one-dimensional problem
Minimize {f(x): x [a, b ]} (14)For simplicity, it will also be assumed that f(x) is continuous and
unimodal in the interval [a, b], implying that f(x) has a single minimum x*—that is, for x [a, b] such that f(x) ≠ f(x*), f is strictly decreasing when x < x* and strictly increasing when x > x*. In the case of a minimization problem, the stronger property of strict convexity implies unimodality, but unimodality does not imply convexity. This fact is illustrated by the unimodal functions shown in Figure 10.9. Each function is both concave and convex in subregions but exhibits only one relative minimum in the entire range.
93
94
During a search procedure, if we could exclude portions of [a, b] that did not contain the minimum, then the interval of uncertainty would be reduced. The following theorem shows that it is possible to obtain a reduction by evaluating two points within the interval.
Theorem 7: Let f be a continuous, unimodal function of a single variable defined over the interval [a, b].
Let X1, X2 [a, b] be such that X1 < X2.
If f(X1)≧f(X2), then f(x)≧f(X2) for all x [a, X1].
If f(X1)≦f(X2) then f(x)≧f(X1)for all x [X2, b].
95
Dichotomous Search Method
Under the restriction that we may evaluate f(x) only at sele
cted points, our goal is to find a technique that will provide either the minimum or a specified interval of uncertainty after a certain number n of evaluations of the function. The simplest method of doing this is known as the dichotomous search method.
Without loss of generality, we restrict our attention to Problem (14). Let the unknown location of the minimum value be denoted by x*.
96
The dichotomous search method requires a specification of the minimal distance ε > 0 between two points X1 and X2
such that one can still be distinguished from the other. The first two measurements are made at ε on either side of the center of the interval [a, b], as shown in Figure 10.11.
X1 = 0.5(a + b - ε) and X2 = 0.5(a + b + ε)
97
On evaluating the function at these points, Theorem 7 allows us to draw one of three conclusions.
• if f(X1) < f(X2), x*must be located between a and X2. This
indicates that the value of b should be updated by setting b to X2.
• if f(X2) < f(X1), x* must be located between X1 and b. This
indicates that the value of a should be updated by setting a to X1.
• if f(X2) =f(X1), x*must be located between X1 and X2. This
indicates that both end-points should be updated by setting a to X1 and b to X2.
98
99
Golden Section Search Method
In the preceding approach, all new evaluations were used at each iteration. Suppose instead that at each iteration after the first we use a combination of one new evaluation and one old evaluation. This should result in a significant reduction of computational effort if comparable results can be achieved. One method of implementing this approach was inspired by a number commonly observed in nature. In the architecture of ancient Greece, for example, a method of dividing a distance measured from point a to point b at a point c was called a golden section if
ac
acab
ac
cb
ab
ac
)()(
100
Dividing the numerators and denominators of each term by b - a and letting γ= (c - a)/ (b - a) yields
where γ is known as the golden section ratio. Solving for γ is equivalent to solving the quadratic equation γ2 + γ- 1 = 0, whose positive root is γ= ( - 1)/2 = 0.618. The negative root would imply a negative ratio, which has no meaning from a geometric point of view
1
5
101
We now use the concept of the golden section to develop what is called the golden section search method. This method requires that the ratio of the new interval of uncertainty to the preceding one always be the same. This can be achieved only if the constant of proportionality is the golden section ratio γ.
To implement the algorithm, we begin with the initial interval [a, b] and place the first two search points symmetrically at
X1= a + (1 – γ)(b - a) = b -γ(b - a) and X2= a + γ(b - a) (16)
as illustrated in Figure 10.13.
By construction, we have X1- a = b – X2, which is maintain
ed throughout the computations.
102
For successive iterations, we determine the interval containing the minimal value of f(x), just as we did in the dichotomous search method. The next step of the golden section method, however, requires only one new evaluation of f(x) with x located at the new golden section point of the new interval of uncertainty. At the end of each iteration, one of the following two cases arises (see Figure 10.13).
• Case 1: If f(X1) > f(X2) , the left endpoint a is updated by
setting a to X1 and the new X1 is set equal to the old X2
A new X2 is computed from Equation (16).
• Case 2: If f(X1)≦f(X2), the right endpoint b is updated by s
etting b to X2 and the new X2 is set equal to the old X1.
A new X1is computed from Equation (16).
103
We stop when b - a < ε, an arbitrarily small number. At termination, one point remains in the final interval, either X1or
X2. The solution is taken as that point.
It can be shown that after k evaluations, the interval of uncertainty, call it dk, has width
(17)
where d1= b - a (initial width). From this it follows that
(18)
11
kkd d
618.01 k
k
d
d
104
Table 10.9 provides the results for the same example used to illustrate the dichotomous search method. From the table we see that after 12 function evaluations (11 iterations) the minimum point found is X2= 2.082 with f= 14.189996. The
true optimal solution is guaranteed to lie in the range [2.0782,2.0882]. The width of this interval is 0.01 unit, which is less than one-fourth of the interval yielded by the dichotomous search method with the same number of evaluations. Equation (17) indicates that the interval of uncertainty after 12 evaluations is similarly 0.01 unit The reader can verify that successive ratios are all (approximately) equal to γ, as specified by Equation (18). For example, for k = 7 we have at the completion of iteration 6 the ratio d6 /d5 = (2.1246 - 1.9443)/(2.2361 -
1.9443) = 0.61789 ≡ γ, with the error attributable to rounding.
105
106
107
Newton's Method When more information than just the value of the function c
an be computed at each iteration, convergence is likely to be accelerated. Suppose that f(x) is unimodal and twice continuously differentiable. In approaching Problem (14), also suppose that at a point Xk where
a measurement is made, it is possible to determine the following three values: f(Xk), f'(Xk,), and f"(Xk). This means tha
t it is possible to construct a quadratic function q(x) that agrees with f(x) up to second derivatives at Xk.
Let
2))((
2
1))(()()( kkkkk xxxfxxxfxfxq
108
As shown in Figure 10.14a, we may then calculate an estimate
of the minimum point of f by finding the point at which the derivative of q vanishes. Thus, setting
o= q'(Xk+1)= f'(Xk)+f"( Xk)( Xk+1 - Xk)
we find
which, incidentally, does not depend on f(Xk). This process can t
hen be repeated until some convergence criterion is met, typically | Xk+1 - Xk | < εor |f'(Xk)| <ε, where ε is some small numbe
r.
1kx
)19()(
)("
'
1k
kkk xf
xfxx
109
Newton's method can more simply be viewed as a technique for iteratively solving equations of the form φ(x) = 0, where φ(x) =f’(x) when applied to the line search problem. In this notation, we have Xk+1 = Xk -φ(Xk) /φ'(Xk). Figure 10.14
b geometrically depicts how the new point is found. The following theorem gives sufficient conditions under which the method will converge to a stationary point.
Theorem 8: Consider the function f(x) with continuous first and second derivatives f'(x) and f”(x). Define φ(x) =f’(x) and φ’(x) =f"(x) and let x* satisfy φ(x*) = 0, φ'(x*)≠0. Then, if X1 is sufficiently close to x*, the sequence generated by
Newton's method [Equation (19)] converges to x* with an order of convergence of at least 2.
110
The phrase "convergence of order ρ" will be defined presently, but for now it means that when the iterate Xk is in the ne
ighborhood of x*, the distance from x* at the next iteration is reduced by the ρth power. Mathematically, this can be stated as , where β< ∞ is some constant. The larger the order ρ, the faster the convergence.
When second derivative information is not available, it is possible to use first-order information to estimate f"(Xk) in the
quadratic q(x). By letting f"(Xk) (f'(Xk-1) - f'(Xk)) / (Xk-1-
Xk), the equivalent of Equation (19) is
**
1 xxxx kk
111
112
which gives rise to what is called the method of false position. Comparing this formula with that of Newton's method [Equation (19)], we see again that the value f(Xk) does not
enter.
)()(
)('
1'
1'1
kk
kkkkk xfxf
xxxfxx
113
General Descent Algorithm
The general descent algorithm starts at an arbitrary point, x° and proceeds for some distance in a direction that improves (decreases) the objective function. Arriving at a point that has a smaller objective value than x°, the process finds a new improving direction and moves in that direction to a new point with a still smaller objective. In theory, the process could continue until there are no improving directions, at which point the algorithm would report a local minimum. In practice, the process stops when one or more numerical convergence criteria are satisfied. The algorithm is stated more formally below.
114
• 1. Start with an initial point x°. Set the iteration counter k to 0.• 2. Choose a descent direction d k.• 3. Perform a line search to choose a step size tk such that
• 4. Set x k+1 =x k +tkd k.
• 5. Evaluate convergence criteria. If satisfied, stop; otherwise, increase • k by 1 and go to Step 2.
An exact line search is one that chooses tk as the first local minimum of wk(tk) at Step 3— i.e., the one with the smallest t value. Finding this minimum to high accuracy is overly time consuming, so modem NLP codes use a variety of inexact line search techniques often involving polynomial fits, as in the method of false position. With regard to termination,
)()()( 1 kkk
kk
kk twdtxftw
115
Application to a Quadratic in Two Dimensions
For purposes of illustration, let us consider the problem of minimizing a two-dimensional quadratic function.
The gradient of. f(x) is
▽f(x) = c +Qx
Thus, starting from the initial point x°, we must solve Problem (21) over the line
)2(2
12112
2222
21112211 xxqxqxqxcxc
Qxxcxxf T
2
1)(
Txqxqcxqxqc ))(,)(( 22211222121111
Tff ),( 21
02
01
02
010 )()(
f
ft
x
xxftxtx
116
to find the new point. The optimal step size, call it t*, can be determined by substituting the right-hand side of the expression above into f(x) and finding the value of t that minimizes f(x(t). For this simple case, it can be shown with some algebra that
0
20
11220
22220
111
202
201*
2)()(
)()(
ffqfqfq
fft