Quasi-Newton Methods for Nonlinear Optimization

Introduction

Introduction to Quasi-Newton Methods

  • Family of optimization algorithms for nonlinear problems

  • Alternative to Newton’s method when Hessian is expensive to compute

  • Approximate the Hessian (or its inverse) using gradient information

  • Key idea: Build curvature information from observed behavior of objective function

  • Particularly useful for medium-sized problems (100-1000 variables)

Motivation

Why Quasi-Newton Methods?

  • Newton’s method: Fast convergence but expensive Hessian computation

  • Gradient descent: Cheap per iteration but slow convergence

  • Quasi-Newton: Balance between speed and computational cost

Key Insight

Use gradient information from multiple iterations to build an approximation of the Hessian matrix without computing second derivatives explicitly.

Comparison with Other Methods

Comparison of optimization methods
Method Requires Hessian? Storage Convergence
Newton’s Yes \(O(n^2)\) Quadratic
Gradient Descent No \(O(n)\) Linear
Quasi-Newton No \(O(n^2)\) Superlinear
  • Quasi-Newton methods offer a balance between computational cost and convergence rate

  • Avoid explicit calculation of second derivatives

  • Maintain superlinear convergence (better than gradient descent)

Mathematical Foundation

The Secant Equation

Secant Condition

For a good Hessian approximation \(B_{k+1}\), we require:

\[B_{k+1} s_k = y_k\]
where:
\[\begin{aligned} s_k &= x_{k+1} - x_k \quad \text{(step vector)} \\ y_k &= \nabla f(x_{k+1}) - \nabla f(x_k) \quad \text{(gradient change)} \end{aligned}\]

  • This mimics the exact relationship: \(\nabla^2 f(\xi) s_k = y_k\) for some \(\xi\)

  • Ensures the approximation captures local curvature

  • All quasi-Newton methods satisfy this condition

Positive Definiteness

Curvature Condition

For convergence, we need \(s_k^T y_k > 0\)

  • Ensures that \(B_k\) remains positive definite

  • Guarantees descent direction: \(p_k^T \nabla f_k < 0\)

  • Automatically satisfied for convex functions

  • May require damping for non-convex problems

Wolfe Conditions

Line search must satisfy:

\[\begin{aligned} f(x_k + \alpha_k p_k) &\leq f(x_k) + c_1 \alpha_k \nabla f_k^T p_k \\ \nabla f(x_k + \alpha_k p_k)^T p_k &\geq c_2 \nabla f_k^T p_k \end{aligned}\]

BFGS Method

Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method

Most popular quasi-Newton algorithm:

  • Named after its four independent discoverers (1970)

  • Approximates the inverse Hessian \(H_k \approx [\nabla^2 f(x_k)]^{-1}\)

  • Update formula:

    \[H_{k+1} = (I - \rho_k s_k y_k^T)H_k(I - \rho_k y_k s_k^T) + \rho_k s_k s_k^T\]
    where:
    \[\begin{aligned} s_k &= x_{k+1} - x_k \\ y_k &= \nabla f_{k+1} - \nabla f_k \\ \rho_k &= \frac{1}{y_k^T s_k} \end{aligned}\]

BFGS Algorithm

Algorithm: Quasi-Newton Method (BFGS)

  1. Choose initial point \( x_0 \) and \( H_0 = I \)
  2. For \( k = 0, 1, 2, \ldots \):
    1. Compute search direction: \( p_k = -H_k \nabla f_k \)
    2. Perform line search to find \( \alpha_k \)
    3. Update position: \( x_{k+1} = x_k + \alpha_k p_k \)
    4. Compute \( s_k = x_{k+1} - x_k \) and \( y_k = \nabla f_{k+1} - \nabla f_k \)
    5. If \( y_k^T s_k > 0 \):
      • Compute \( \rho_k = \frac{1}{y_k^T s_k} \)
      • Update \( H_{k+1} \) using BFGS formula
    6. Check convergence: if \( \| \nabla f_k \| < \epsilon \), stop

BFGS Properties

Theoretical Properties

  • Maintains positive definiteness if \(H_0\) is positive definite

  • Satisfies the secant equation: \(H_{k+1} y_k = s_k\)

  • Has superlinear convergence for convex problems

  • Self-correcting: errors in \(H_k\) are eventually corrected

Practical Advantages

  • Robust and reliable

  • Good performance on a wide range of problems

  • Handles ill-conditioning better than gradient descent

  • Limited memory variant (L-BFGS) available for large problems

DFP Method

Davidon-Fletcher-Powell (DFP) Method

  • The first quasi-Newton method (discovered 1959, published 1963)

  • Approximates the Hessian \(B_k \approx \nabla^2 f(x_k)\) rather than its inverse

  • Update formula:

    \[B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}\]

  • Equivalent to BFGS when applied to inverse Hessian

  • Less used today than BFGS, but historically important

DFP vs BFGS

BFGS generally outperforms DFP in practice, especially for inexact line searches

Other Quasi-Newton Methods

Broyden’s Method

  • Uses rank-one updates instead of rank-two

  • Update formula:

    \[B_{k+1} = B_k + \frac{(y_k - B_k s_k)s_k^T}{s_k^T s_k}\]

  • Simpler but may lose positive definiteness

  • Mainly used for systems of nonlinear equations

Limited Memory BFGS (L-BFGS)

Motivation

BFGS requires \(O(n^2)\) storage, prohibitive for large \(n\)

  • Stores only the last \(m\) pairs \((s_i, y_i)\) instead of full matrix

  • Typical choice: \(m = 3\) to \(20\)

  • Storage requirement: \(O(mn)\) instead of \(O(n^2)\)

  • Particularly effective for large-scale optimization

  • Used in machine learning and deep learning applications

Two-Loop Recursion

Efficient algorithm to compute \(H_k \nabla f_k\) without storing \(H_k\) explicitly

Update Formulas

Rank-One and Rank-Two Updates

General Update Formula

Most quasi-Newton methods use updates of the form:

\[B_{k+1} = B_k + \text{low-rank matrix}\]

  • Rank-one update: Adds a matrix of rank 1 (e.g., outer product of vectors)

    \[B_{k+1} = B_k + \alpha v v^T\]

  • Rank-two update: Adds a matrix of rank 2 (e.g., sum of two outer products)

    \[B_{k+1} = B_k + \alpha v v^T + \beta w w^T\]

  • Both BFGS and DFP use rank-two updates

  • Rank-one updates exist but may not preserve positive definiteness

Convergence Analysis

Convergence Properties

Convergence Rate

  • Superlinear convergence: \(\|x_{k+1} - x^*\| = O(\|x_k - x^*\|^r)\) where \(r > 1\)

  • For quadratic functions: finite termination in \(n\) steps

  • Better than linear convergence of gradient descent

  • Slower than quadratic convergence of Newton’s method

Convergence Conditions

  • Objective function must be twice continuously differentiable

  • Hessian must be positive definite at the optimum

  • Line search must satisfy Wolfe conditions

  • Starting point must be sufficiently close to optimum

Practical Considerations

Implementation Details

Line Search Strategy

  • Strong Wolfe conditions ensure \(s_k^T y_k > 0\)

  • Backtracking line search is commonly used

  • Exact line search not necessary (and often expensive)

Initialization

  • Usually start with \(H_0 = I\) or \(H_0 = \gamma I\)

  • Can use scaling: \(\gamma = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}}\)

  • Good initialization can improve early convergence

Common Issues and Solutions

Numerical Issues

  • Loss of positive definiteness: Skip update or use damping

  • Ill-conditioning: Regular restarts or modified updates

  • Stagnation: Check for degenerate cases

Performance Tips

  • Use L-BFGS for large problems (\(n > 1000\))

  • Consider preconditioning for ill-conditioned problems

  • Monitor condition number of Hessian approximation

  • Implement efficient matrix operations

Comparison

Comparison with Newton’s Method

  • Newton’s Method:

    • Requires exact Hessian \(\nabla^2 f(x_k)\)

    • Needs \(O(n^3)\) operations per iteration (matrix inversion)

    • May not be positive definite

    • Quadratic convergence near optimum

  • Quasi-Newton Methods:

    • Build approximation using only gradient information

    • Maintain positive definiteness (for convex problems)

    • \(O(n^2)\) storage for Hessian approximation

    • Superlinear but not quadratic convergence

Comparison with Gradient Descent

  • Gradient Descent:

    • Only uses first-order information

    • \(O(n)\) storage

    • Linear convergence rate

    • May be very slow for ill-conditioned problems

  • Quasi-Newton Methods:

    • Incorporate approximate second-order information

    • Faster convergence (superlinear)

    • Better handling of ill-conditioned problems

    • Higher memory requirements

Applications

Applications and Use Cases

Typical Applications

  • Parameter estimation in statistical models

  • Machine learning optimization (especially L-BFGS)

  • Engineering design optimization

  • Portfolio optimization in finance

  • Nonlinear least squares problems

When to Choose Quasi-Newton

  • Gradient is available but Hessian is expensive

  • Problem size: medium to large (100-10,000 variables)

  • Objective function is smooth and well-behaved

  • Superlinear convergence is desired

Conclusion

Summary

  • Quasi-Newton methods provide efficient optimization for medium-scale problems

  • BFGS is the most popular variant with good theoretical properties

  • DFP is historically important but less used today

  • L-BFGS extends applicability to large-scale problems

  • Rank-two updates maintain positive definiteness and curvature information

  • Offer a practical compromise between Newton’s method and gradient descent

When to Use

  • Problems with expensive Hessian evaluation

  • Problems where \(O(n^2)\) storage is acceptable (or use L-BFGS)

  • When superlinear convergence is desired

  • Smooth, well-conditioned optimization problems

Further Reading

  • Nocedal, J. & Wright, S. J. (2006). Numerical Optimization. Springer.

  • Fletcher, R. (2013). Practical Methods of Optimization. Wiley.

  • Broyden, C. G. (1970). The convergence of a class of double-rank minimization algorithms. IMA Journal of Applied Mathematics.

  • Liu, D. C. & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming.