Unconstrained Optimization: Methods for Local Minimization
Introduction | Gauss–Newton Methods |
Newton's Method | Nonlinear Conjugate Gradient Methods |
Quasi-Newton Methods | Principal Axis Method |
Introduction
The essence of most methods is in the local quadratic model
that is used to determine the next step. The FindMinimum function in the Wolfram Language has five essentially different ways of choosing this model, controlled by the method option. These methods are similarly used by FindMaximum and FindFit.
"Newton" | use the exact Hessian or a finite difference approximation if the symbolic derivative cannot be computed |
"QuasiNewton" | use the quasi-Newton BFGS approximation to the Hessian built up by updates based on past steps |
"LevenbergMarquardt" | a Gauss–Newton method for least-squares problems; the Hessian is approximated by , where is the Jacobian of the residual function |
"ConjugateGradient" | a nonlinear version of the conjugate gradient method for solving linear systems; a model Hessian is never formed explicitly |
"PrincipalAxis" | works without using any derivatives, not even the gradient, by keeping values from past steps; it requires two starting conditions in each variable |
Basic method choices for FindMinimum.
With Method->Automatic, the Wolfram Language uses the quasi-Newton method unless the problem is structurally a sum of squares, in which case the Levenberg–Marquardt variant of the Gauss–Newton method is used. When given two starting conditions in each variable, the principal axis method is used.
Newton's Method
One significant advantage the Wolfram Language provides is that it can symbolically compute derivatives. This means that when you specify Method->"Newton" and the function is explicitly differentiable, the symbolic derivative will be computed automatically. On the other hand, if the function is not in a form that can be explicitly differentiated, the Wolfram Language will use finite difference approximations to compute the Hessian, using structural information to minimize the number of evaluations required. Alternatively, you can specify a Wolfram Language expression, which will give the Hessian with numerical values of the variables.
The derivative of this function cannot be found symbolically since the function has been defined only to evaluate with numerical values of the variables.
When the gradient and Hessian are both computed using finite differences, the error in the Hessian may be quite large and it may be better to use a different method. In this case, FindMinimum does find the minimum quite accurately, but cannot be sure because of inadequate derivative information. Also, the number of function and gradient evaluations is much greater than in the example with the symbolic derivatives computed automatically because extra evaluations are required to approximate the gradient and Hessian, respectively.
If it is possible to supply the gradient (or the function is such that it can be computed automatically), the method will typically work much better. You can give the gradient using the Gradient option, which has several ways you can specify derivatives.
If you can provide a program that gives the Hessian, you can provide this also. Because the Hessian is only used by Newton's method, it is given as a method option of "Newton".
In principle, Newton's method uses the Hessian computed either by evaluating the symbolic derivative or by using finite differences. However, the convergence for the method computed this way depends on the function being convex, in which case the Hessian is always positive definite. It is common that a search will start at a location where this condition is violated, so the algorithm needs to take this possibility into account.
When sufficiently near a local maximum, the Hessian is actually negative definite.
If you were to only apply the Newton step formula in cases where the Hessian is not positive definite, it is possible to get a step direction that does not lead to a decrease in the function value.
It is crucial for the convergence of line search methods that the direction be computed using a positive definite quadratic model since the search process and convergence results derived from it depend on a direction with sufficient descent. See "Line Search Methods". The Wolfram Language modifies the Hessian by a diagonal matrix with entries large enough so that is positive definite. Such methods are sometimes referred to as modified Newton methods. The modification to is done during the process of computing a Cholesky decomposition somewhat along the lines described in [GMW81], both for dense and sparse Hessians. The modification is only done if is not positive definite. This decomposition method is accessible through LinearSolve if you want to use it independently.
Besides the robustness of the (modified) Newton method, another key aspect is its convergence rate. Once a search is close enough to a local minimum, the convergence is said to be -quadratic, which means that if is the local minimum point, then
At machine precision, this does not always make a substantial difference, since it is typical that most of the steps are spent getting near to the local minimum. However, if you want a root to extremely high precision, Newton's method is usually the best choice because of the rapid convergence.
When the option WorkingPrecision->prec is used, the default for the AccuracyGoal and PrecisionGoal is prec/2. Thus, this example should find the minimum to at least 50000 digits.
The reason that more function evaluations were required than the number of steps is that the Wolfram Language adaptively increases the precision from the precision of the initial value to the requested maximum WorkingPrecision. The sequence of precisions used is chosen so that as few computations are done at the most expensive final precision as possible, under the assumption that the points are converging to the minimum. Sometimes when the Wolfram Language changes precision, it is necessary to reevaluate the function at the higher precision.
Note that typically the precision is roughly double the scale of the error. For Newton's method this is appropriate since when the step is computed, the scale of the error will effectively double according to the quadratic convergence.
FindMinimum always starts with the precision of the starting values you gave it. Thus, if you do not want it to use adaptive precision control, you can start with values, which are exact or have at least the maximum WorkingPrecision.
Even though this may use fewer function evaluations, they are all done at the highest precision, so typically adaptive precision saves a lot of time. For example, the previous command without adaptive precision takes more than 50 times as long as when starting from machine precision.
With Newton's method, both line search and trust region step control are implemented. The default, which is used in the preceding examples, is the line search. However, any of them may be done with the trust region approach. The approach typically requires more numerical linear algebra computations per step, but because steps are better controlled, may converge in fewer iterations.
You can see from the comparison of the two plots that the trust region method has kept the steps within better control as the search follows the valley and consequently converges with fewer function evaluations.
The following table summarizes the options you can use with Newton's method.
option name | default value | |
"Hessian" | Automatic | an expression to use for computing the Hessian matrix |
"StepControl" | "LineSearch" | how to control steps; options include "LineSearch", "TrustRegion", or None |
Method options for Method->"Newton".
Quasi-Newton Methods
There are many variants of quasi-Newton methods. In all of them, the idea is to base the matrix in the quadratic model
on an approximation of the Hessian matrix built up from the function and gradient values from some or all steps previously taken.
The first thing to notice about the path taken in this example is that it starts in the wrong direction. This direction is chosen because at the first step all the method has to go by is the gradient, and so it takes the direction of steepest descent. However, in subsequent steps, it incorporates information from the values of the function and gradient at the steps taken to build up an approximate model of the Hessian.
The methods used by the Wolfram Language are the Broyden–Fletcher–Goldfarb–Shanno (BFGS) updates and, for large systems, the limited-memory BFGS (L-BFGS) methods, in which the model is not stored explicitly, but rather is calculated by gradients and step directions stored from past steps.
The BFGS method is implemented such that instead of forming the model Hessian at each step, Cholesky factors such that are computed so that only operations are needed to solve the system [DS96] for a problem with variables.
For large-scale sparse problems, the BFGS method can be problematic because, in general, the Cholesky factors (or the Hessian approximation or its inverse) are dense, so the memory and operations requirements become prohibitive compared to algorithms that take advantage of sparseness. The L-BFGS algorithm [NW99] forms an approximation to the inverse Hessian based on the last past steps, which are stored. The Hessian approximation may not be as complete, but the memory and order of operations are limited to for a problem with variables. In the Wolfram Language, for problems over 250 variables, the algorithm is switched automatically to L-BFGS. You can control this with the method option "StepMemory"->m. With , the full BFGS method will always be used. Choosing an appropriate value of is a tradeoff between speed of convergence and the work done per step. With , you are most likely better off using a "conjugate gradient" algorithm.
Quasi-Newton methods are chosen as the default in the Wolfram Language because they are typically quite fast and do not require computation of the Hessian matrix, which can be quite expensive both in terms of the symbolic computation and numerical evaluation. With an adequate line search, they can be shown to converge super-linearly [NW99] to a local minimum where the Hessian is positive definite. This means that
or, in other words, the steps keep getting smaller. However, for very high precision, this does not compare to the -quadratic convergence rate of Newton's method.
Newton's method is able to find ten times as many digits with far fewer steps because of its quadratic convergence rate. However, the convergence with the quasi-Newton method is still superlinear since the ratio of the errors is clearly going to zero.
The following table summarizes the options you can use with quasi-Newton methods.
option name | default value | |
"StepMemory" | Automatic | the effective number of steps to "remember" in the Hessian approximation; can be a positive integer or Automatic |
"StepControl" | "LineSearch" | how to control steps; can be "LineSearch" or None |
Method options for Method->"QuasiNewton".
Gauss–Newton Methods
For minimization problems for which the objective function is a sum of squares,
it is often advantageous to use the special structure of the problem. Time and effort can be saved by computing the residual function , and its derivative, the Jacobian . The Gauss–Newton method is an elegant way to do this. Rather than using the complete second-order Hessian matrix for the quadratic model, the Gauss–Newton method uses in (1) such that the step is computed from the formula
where , and so on. Note that this is an approximation to the full Hessian, which is . In the zero residual case, where is the minimum, or when varies nearly as a linear function near the minimum point, the approximation to the Hessian is quite good and the quadratic convergence of Newton's method is commonly observed.
Objective functions, which are sums of squares, are quite common, and, in fact, this is the form of the objective function when FindFit is used with the default value of the NormFunction option. One way to view the Gauss–Newton method is in terms of least-squares problems. Solving the Gauss–Newton step is the same as solving a linear least-squares problem, so applying a Gauss–Newton method is in effect applying a sequence of linear least-squares fits to a nonlinear function. With this view, it makes sense that this method is particularly appropriate for the sort of nonlinear fitting that FindFit does.
When the Wolfram Language encounters a problem that is expressly a sum of squares, such as the Rosenbrock example, or a function that is the dot product of a vector with itself, the Gauss–Newton method will be used automatically.
If you compare this with the same example done with Newton's method, you can see that it was done with fewer steps and evaluations because the Gauss–Newton method is taking advantage of the special structure of the problem. The convergence rate near the minimum is just as good as for Newton's method because the residual is zero at the minimum.
The Levenberg–Marquardt method is a Gauss–Newton method with trust region step control (though it was originally proposed before the general notion of trust regions had been developed). You can request this method specifically by using the FindMinimum option Method->"LevenbergMarquardt" or equivalently Method->"GaussNewton".
Sometimes it is awkward to express a function so that it will explicitly be a sum of squares or a dot product of a vector with itself. In these cases, it is possible to use the "Residual" method option to specify the residual directly. Similarly, you can specify the derivative of the residual with the "Jacobian" method option. Note that when the residual is specified through the "Residual" method option, it is not checked for consistency with the first argument of FindMinimum. The values returned will depend on the value given through the option.
option name | default value | |
"Residual" | Automatic | allows you to directly specify the residual such that |
"EvaluationMonitor" | Automatic | an expression that is evaluated each time the residual is evaluated |
"Jacobian" | Automatic | allows you to specify the (matrix) derivative of the residual |
"StepControl" | "TrustRegion" | must be "TrustRegion", but allows you to change control parameters through method options |
Method options for Method->"LevenbergMarquardt".
Another natural way of setting up sums of squares problems in the Wolfram Language is with FindFit, which computes nonlinear fits to data. A simple example follows.
In the example, FindFit internally constructs a residual function and Jacobian, which are in turn used by the Gauss–Newton method to find the minimum of the sum of squares, or the nonlinear least-squares fit. Of course, FindFit can be used with other methods, but because a residual function that evaluates rapidly can be constructed, it is often faster than the other methods.
Nonlinear Conjugate Gradient Methods
The basis for a nonlinear conjugate gradient method is to effectively apply the linear conjugate gradient method, where the residual is replaced by the gradient. A model quadratic function is never explicitly formed, so it is always combined with a line search method.
The first nonlinear conjugate gradient method was proposed by Fletcher and Reeves as follows. Given a step direction , use the line search to find such that . Then compute
It is essential that the line search for choosing satisfies the strong Wolfe conditions; this is necessary to ensure that the directions are descent directions [NW99].
An alternate method, which generally (but not always) works better in practice, is that of Polak and Ribiere, where equation (2) is replaced with
In formula (3), it is possible that can become negative, in which case the Wolfram Language uses the algorithm modified by using . In the Wolfram Language, the default conjugate gradient method is Polak–Ribiere, but the Fletcher–Reeves method can be chosen by using the method option
Method->{"ConjugateGradient",Method->"FletcherReeves"}.
The advantage of conjugate gradient methods is that they use relatively little memory for large-scale problems and require no numerical linear algebra, so each step is quite fast. The disadvantage is that they typically converge much more slowly than Newton or quasi-Newton methods. Also, steps are typically poorly scaled for length, so the line search algorithm may require more iterations each time to find an acceptable step.
One issue that arises with nonlinear conjugate gradient methods is when to restart them. As the search moves, the nature of the local quadratic approximation to the function may change substantially. The local convergence of the method depends on that of the linear conjugate gradient method, where the quadratic function is constant. With a constant quadratic function for n variables and an exact line search, the linear algorithm will converge in n or fewer iterations. By restarting (taking a steepest descent step with ) every so often, it is possible to eliminate information from previous points, which may not be relevant to the local quadratic model at the current search point. If you look carefully at the example, you can see where the method was restarted and a steepest descent step was taken. One option is to simply restart after every k iterations, where k<=n. You can specify this using the method option "RestartIterations"->k. An alternative is to restart when consecutive gradients are not sufficiently orthogonal according to the test
with a threshold ν between 0 and 1. You can specify this using the method option "RestartThreshold"->ν.
The table summarizes the options you can use with the conjugate gradient methods.
option name | default value | |
"Method" | "PolakRibiere" | nonlinear conjugate gradient method can be "PolakRibiere" or "FletcherReeves" |
"RestartThreshold" | 1/10 | threshold ν for gradient orthogonality below which a restart will be done |
"RestartIterations" | ∞ | number of iterations after which to restart |
"StepControl" | "LineSearch" | must be "LineSearch", but you can use this to specify line search methods |
Method options for Method->"ConjugateGradient".
It should be noted that the default method for FindMinimum in version 4 was a conjugate gradient method with a near exact line search. This has been maintained for legacy reasons and can be accessed by using the FindMinimum option Method->"Gradient". Typically, this will use more function and gradient evaluations than the newer Method->"ConjugateGradient", which itself often uses far more than the methods that the Wolfram Language currently uses as defaults.
Principal Axis Method
Gauss–Newton and conjugate gradient methods use derivatives. When the Wolfram Language cannot compute symbolic derivatives, finite differences will be used. Computing derivatives with finite differences can impose a significant cost in some cases and certainly affects the reliability of derivatives, ultimately having an effect on how good an approximation to the minimum is achievable. For functions where symbolic derivatives are not available, an alternative is to use a derivative-free algorithm, where an approximate model is built up using only values from function evaluations.
The Wolfram Language uses the principal axis method of Brent [Br02] as a derivative-free algorithm. For an -variable problem, take a set of search directions and a point . Take to be the point that minimizes along the direction from (i.e. do a line search from ), then replace with . At the end, replace with . Ideally, the new should be linearly independent, so that a new iteration could be undertaken, but in practice, they are not. Brent's algorithm involves using the singular value decomposition (SVD) on the matrix to realign them to the principal directions for the local quadratic model. (An eigen decomposition could be used, but Brent shows that the SVD is more efficient.) With the new set of obtained, another iteration can be done.
Two distinct starting conditions in each variable are required for this method because these are used to define the magnitudes of the vectors . In fact, whenever you specify two starting conditions in each variable, FindMinimum, FindMaximum, and FindFit will use the principal axis algorithm by default.
The basics of the search algorithm can be seen quite well from the plot since the derivative-free line search algorithm requires a substantial number of function evaluations. First a line search is done in the direction, then from that point, a line search is done in the direction, determining the step direction. Once the step is taken, the vectors are realigned appropriately to the principal directions of the local quadratic approximation and the next step is similarly computed.
The algorithm is efficient in terms of convergence rate; it has quadratic convergence in terms of steps. However, in terms of function evaluations, it is quite expensive because of the derivative-free line search required. Note that since the directions given to the line search (especially at the beginning) are not necessarily descent directions, the line search has to be able to search in both directions. For problems with many variables, the individual linear searches in all directions become very expensive, so this method is typically better suited to problems without too many variables.