Introduction

Introduction to Quasi-Newton Methods

Family of optimization algorithms for nonlinear problems
Alternative to Newton’s method when Hessian is expensive to compute
Approximate the Hessian (or its inverse) using gradient information
Key idea: Build curvature information from observed behavior of objective function
Particularly useful for medium-sized problems (100-1000 variables)

Motivation

Why Quasi-Newton Methods?

Newton’s method: Fast convergence but expensive Hessian computation
Gradient descent: Cheap per iteration but slow convergence
Quasi-Newton: Balance between speed and computational cost

Key Insight

Use gradient information from multiple iterations to build an approximation of the Hessian matrix without computing second derivatives explicitly.

Comparison with Other Methods

Comparison of optimization methods
Method	Requires Hessian?	Storage	Convergence
Newton’s	Yes	\(O(n^2)\)	Quadratic
Gradient Descent	No	\(O(n)\)	Linear
Quasi-Newton	No	\(O(n^2)\)	Superlinear

Quasi-Newton methods offer a balance between computational cost and convergence rate
Avoid explicit calculation of second derivatives
Maintain superlinear convergence (better than gradient descent)

Mathematical Foundation

The Secant Equation

Secant Condition

For a good Hessian approximation \(B_{k+1}\), we require:

\[B_{k+1} s_k = y_k\]

where:

\[\begin{aligned} s_k &= x_{k+1} - x_k \quad \text{(step vector)} \\ y_k &= \nabla f(x_{k+1}) - \nabla f(x_k) \quad \text{(gradient change)} \end{aligned}\]

This mimics the exact relationship: \(\nabla^2 f(\xi) s_k = y_k\) for some \(\xi\)
Ensures the approximation captures local curvature
All quasi-Newton methods satisfy this condition

Positive Definiteness

Curvature Condition

For convergence, we need \(s_k^T y_k > 0\)

Ensures that \(B_k\) remains positive definite
Guarantees descent direction: \(p_k^T \nabla f_k < 0\)
Automatically satisfied for convex functions
May require damping for non-convex problems

Wolfe Conditions

Line search must satisfy:

\[\begin{aligned} f(x_k + \alpha_k p_k) &\leq f(x_k) + c_1 \alpha_k \nabla f_k^T p_k \\ \nabla f(x_k + \alpha_k p_k)^T p_k &\geq c_2 \nabla f_k^T p_k \end{aligned}\]

BFGS Method

Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method

BFGS Algorithm

Algorithm: Quasi-Newton Method (BFGS)

Choose initial point \( x_0 \) and \( H_0 = I \)
For \( k = 0, 1, 2, \ldots \):
1. Compute search direction: \( p_k = -H_k \nabla f_k \)
2. Perform line search to find \( \alpha_k \)
3. Update position: \( x_{k+1} = x_k + \alpha_k p_k \)
4. Compute \( s_k = x_{k+1} - x_k \) and \( y_k = \nabla f_{k+1} - \nabla f_k \)
5. If \( y_k^T s_k > 0 \):
  - Compute \( \rho_k = \frac{1}{y_k^T s_k} \)
  - Update \( H_{k+1} \) using BFGS formula
6. Check convergence: if \( \| \nabla f_k \| < \epsilon \), stop

BFGS Properties

Theoretical Properties

Maintains positive definiteness if \(H_0\) is positive definite
Satisfies the secant equation: \(H_{k+1} y_k = s_k\)
Has superlinear convergence for convex problems
Self-correcting: errors in \(H_k\) are eventually corrected

Practical Advantages

Robust and reliable
Good performance on a wide range of problems
Handles ill-conditioning better than gradient descent
Limited memory variant (L-BFGS) available for large problems

DFP Method

Davidon-Fletcher-Powell (DFP) Method

The first quasi-Newton method (discovered 1959, published 1963)
Approximates the Hessian \(B_k \approx \nabla^2 f(x_k)\) rather than its inverse
Update formula:
\[B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}\]
Equivalent to BFGS when applied to inverse Hessian
Less used today than BFGS, but historically important

DFP vs BFGS

BFGS generally outperforms DFP in practice, especially for inexact line searches

Other Quasi-Newton Methods

Broyden’s Method

Uses rank-one updates instead of rank-two
Update formula:
\[B_{k+1} = B_k + \frac{(y_k - B_k s_k)s_k^T}{s_k^T s_k}\]
Simpler but may lose positive definiteness
Mainly used for systems of nonlinear equations

Limited Memory BFGS (L-BFGS)

Motivation

BFGS requires \(O(n^2)\) storage, prohibitive for large \(n\)

Stores only the last \(m\) pairs \((s_i, y_i)\) instead of full matrix
Typical choice: \(m = 3\) to \(20\)
Storage requirement: \(O(mn)\) instead of \(O(n^2)\)
Particularly effective for large-scale optimization
Used in machine learning and deep learning applications

Two-Loop Recursion

Efficient algorithm to compute \(H_k \nabla f_k\) without storing \(H_k\) explicitly

Update Formulas

Rank-One and Rank-Two Updates

General Update Formula

Most quasi-Newton methods use updates of the form:

\[B_{k+1} = B_k + \text{low-rank matrix}\]

Rank-one update: Adds a matrix of rank 1 (e.g., outer product of vectors)
\[B_{k+1} = B_k + \alpha v v^T\]
Rank-two update: Adds a matrix of rank 2 (e.g., sum of two outer products)
\[B_{k+1} = B_k + \alpha v v^T + \beta w w^T\]
Both BFGS and DFP use rank-two updates
Rank-one updates exist but may not preserve positive definiteness

Convergence Analysis

Convergence Properties

Convergence Rate

Superlinear convergence: \(\|x_{k+1} - x^*\| = O(\|x_k - x^*\|^r)\) where \(r > 1\)
For quadratic functions: finite termination in \(n\) steps
Better than linear convergence of gradient descent
Slower than quadratic convergence of Newton’s method

Convergence Conditions

Objective function must be twice continuously differentiable
Hessian must be positive definite at the optimum
Line search must satisfy Wolfe conditions
Starting point must be sufficiently close to optimum

Practical Considerations

Implementation Details

Line Search Strategy

Strong Wolfe conditions ensure \(s_k^T y_k > 0\)
Backtracking line search is commonly used
Exact line search not necessary (and often expensive)

Initialization

Usually start with \(H_0 = I\) or \(H_0 = \gamma I\)
Can use scaling: \(\gamma = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}}\)
Good initialization can improve early convergence

Common Issues and Solutions

Numerical Issues

Loss of positive definiteness: Skip update or use damping
Ill-conditioning: Regular restarts or modified updates
Stagnation: Check for degenerate cases

Performance Tips

Use L-BFGS for large problems (\(n > 1000\))
Consider preconditioning for ill-conditioned problems
Monitor condition number of Hessian approximation
Implement efficient matrix operations

Comparison

Comparison with Newton’s Method

Newton’s Method:
- Requires exact Hessian \(\nabla^2 f(x_k)\)
- Needs \(O(n^3)\) operations per iteration (matrix inversion)
- May not be positive definite
- Quadratic convergence near optimum
Quasi-Newton Methods:
- Build approximation using only gradient information
- Maintain positive definiteness (for convex problems)
- \(O(n^2)\) storage for Hessian approximation
- Superlinear but not quadratic convergence

Comparison with Gradient Descent

Gradient Descent:
- Only uses first-order information
- \(O(n)\) storage
- Linear convergence rate
- May be very slow for ill-conditioned problems
Quasi-Newton Methods:
- Incorporate approximate second-order information
- Faster convergence (superlinear)
- Better handling of ill-conditioned problems
- Higher memory requirements

Applications

Applications and Use Cases

Typical Applications

Parameter estimation in statistical models
Machine learning optimization (especially L-BFGS)
Engineering design optimization
Portfolio optimization in finance
Nonlinear least squares problems

When to Choose Quasi-Newton

Gradient is available but Hessian is expensive
Problem size: medium to large (100-10,000 variables)
Objective function is smooth and well-behaved
Superlinear convergence is desired

Conclusion

Summary

Quasi-Newton methods provide efficient optimization for medium-scale problems
BFGS is the most popular variant with good theoretical properties
DFP is historically important but less used today
L-BFGS extends applicability to large-scale problems
Rank-two updates maintain positive definiteness and curvature information
Offer a practical compromise between Newton’s method and gradient descent

When to Use

Problems with expensive Hessian evaluation
Problems where \(O(n^2)\) storage is acceptable (or use L-BFGS)
When superlinear convergence is desired
Smooth, well-conditioned optimization problems

Introduction

Introduction to Quasi-Newton Methods

Motivation

Why Quasi-Newton Methods?

Key Insight

Comparison with Other Methods

Mathematical Foundation

The Secant Equation

Secant Condition

Positive Definiteness

Curvature Condition

Wolfe Conditions

BFGS Method

Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method

BFGS Algorithm

Algorithm: Quasi-Newton Method (BFGS)

BFGS Properties

Theoretical Properties

Practical Advantages

DFP Method

Davidon-Fletcher-Powell (DFP) Method

DFP vs BFGS

Other Quasi-Newton Methods

Broyden’s Method

Limited Memory BFGS (L-BFGS)

Motivation

Two-Loop Recursion

Update Formulas

Rank-One and Rank-Two Updates

General Update Formula

Convergence Analysis

Convergence Properties

Convergence Rate

Convergence Conditions

Practical Considerations

Implementation Details

Line Search Strategy

Initialization

Common Issues and Solutions

Numerical Issues

Performance Tips

Comparison

Comparison with Newton’s Method

Comparison with Gradient Descent

Applications

Applications and Use Cases

Typical Applications

When to Choose Quasi-Newton

Conclusion

Summary

When to Use

Further Reading