Introduction
Introduction to Quasi-Newton Methods
-
Family of optimization algorithms for nonlinear problems
-
Alternative to Newton’s method when Hessian is expensive to compute
-
Approximate the Hessian (or its inverse) using gradient information
-
Key idea: Build curvature information from observed behavior of objective function
-
Particularly useful for medium-sized problems (100-1000 variables)
Motivation
Why Quasi-Newton Methods?
-
Newton’s method: Fast convergence but expensive Hessian computation
-
Gradient descent: Cheap per iteration but slow convergence
-
Quasi-Newton: Balance between speed and computational cost
Key Insight
Use gradient information from multiple iterations to build an approximation of the Hessian matrix without computing second derivatives explicitly.
Comparison with Other Methods
Method | Requires Hessian? | Storage | Convergence |
---|---|---|---|
Newton’s | Yes | \(O(n^2)\) | Quadratic |
Gradient Descent | No | \(O(n)\) | Linear |
Quasi-Newton | No | \(O(n^2)\) | Superlinear |
-
Quasi-Newton methods offer a balance between computational cost and convergence rate
-
Avoid explicit calculation of second derivatives
-
Maintain superlinear convergence (better than gradient descent)
Mathematical Foundation
The Secant Equation
Secant Condition
For a good Hessian approximation \(B_{k+1}\), we require:
-
This mimics the exact relationship: \(\nabla^2 f(\xi) s_k = y_k\) for some \(\xi\)
-
Ensures the approximation captures local curvature
-
All quasi-Newton methods satisfy this condition
Positive Definiteness
Curvature Condition
For convergence, we need \(s_k^T y_k > 0\)
-
Ensures that \(B_k\) remains positive definite
-
Guarantees descent direction: \(p_k^T \nabla f_k < 0\)
-
Automatically satisfied for convex functions
-
May require damping for non-convex problems
Wolfe Conditions
Line search must satisfy:
BFGS Method
Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method
Most popular quasi-Newton algorithm:
-
Named after its four independent discoverers (1970)
-
Approximates the inverse Hessian \(H_k \approx [\nabla^2 f(x_k)]^{-1}\)
-
Update formula:
\[H_{k+1} = (I - \rho_k s_k y_k^T)H_k(I - \rho_k y_k s_k^T) + \rho_k s_k s_k^T\]where:\[\begin{aligned} s_k &= x_{k+1} - x_k \\ y_k &= \nabla f_{k+1} - \nabla f_k \\ \rho_k &= \frac{1}{y_k^T s_k} \end{aligned}\]
BFGS Algorithm
Algorithm: Quasi-Newton Method (BFGS)
- Choose initial point \( x_0 \) and \( H_0 = I \)
- For \( k = 0, 1, 2, \ldots \):
- Compute search direction: \( p_k = -H_k \nabla f_k \)
- Perform line search to find \( \alpha_k \)
- Update position: \( x_{k+1} = x_k + \alpha_k p_k \)
- Compute \( s_k = x_{k+1} - x_k \) and \( y_k = \nabla f_{k+1} - \nabla f_k \)
- If \( y_k^T s_k > 0 \):
- Compute \( \rho_k = \frac{1}{y_k^T s_k} \)
- Update \( H_{k+1} \) using BFGS formula
- Check convergence: if \( \| \nabla f_k \| < \epsilon \), stop
BFGS Properties
Theoretical Properties
-
Maintains positive definiteness if \(H_0\) is positive definite
-
Satisfies the secant equation: \(H_{k+1} y_k = s_k\)
-
Has superlinear convergence for convex problems
-
Self-correcting: errors in \(H_k\) are eventually corrected
Practical Advantages
-
Robust and reliable
-
Good performance on a wide range of problems
-
Handles ill-conditioning better than gradient descent
-
Limited memory variant (L-BFGS) available for large problems
DFP Method
Davidon-Fletcher-Powell (DFP) Method
-
The first quasi-Newton method (discovered 1959, published 1963)
-
Approximates the Hessian \(B_k \approx \nabla^2 f(x_k)\) rather than its inverse
-
Update formula:
\[B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}\] -
Equivalent to BFGS when applied to inverse Hessian
-
Less used today than BFGS, but historically important
DFP vs BFGS
BFGS generally outperforms DFP in practice, especially for inexact line searches
Other Quasi-Newton Methods
Broyden’s Method
-
Uses rank-one updates instead of rank-two
-
Update formula:
\[B_{k+1} = B_k + \frac{(y_k - B_k s_k)s_k^T}{s_k^T s_k}\] -
Simpler but may lose positive definiteness
-
Mainly used for systems of nonlinear equations
Limited Memory BFGS (L-BFGS)
Motivation
BFGS requires \(O(n^2)\) storage, prohibitive for large \(n\)
-
Stores only the last \(m\) pairs \((s_i, y_i)\) instead of full matrix
-
Typical choice: \(m = 3\) to \(20\)
-
Storage requirement: \(O(mn)\) instead of \(O(n^2)\)
-
Particularly effective for large-scale optimization
-
Used in machine learning and deep learning applications
Two-Loop Recursion
Efficient algorithm to compute \(H_k \nabla f_k\) without storing \(H_k\) explicitly
Update Formulas
Rank-One and Rank-Two Updates
General Update Formula
Most quasi-Newton methods use updates of the form:
-
Rank-one update: Adds a matrix of rank 1 (e.g., outer product of vectors)
\[B_{k+1} = B_k + \alpha v v^T\] -
Rank-two update: Adds a matrix of rank 2 (e.g., sum of two outer products)
\[B_{k+1} = B_k + \alpha v v^T + \beta w w^T\] -
Both BFGS and DFP use rank-two updates
-
Rank-one updates exist but may not preserve positive definiteness
Convergence Analysis
Convergence Properties
Convergence Rate
-
Superlinear convergence: \(\|x_{k+1} - x^*\| = O(\|x_k - x^*\|^r)\) where \(r > 1\)
-
For quadratic functions: finite termination in \(n\) steps
-
Better than linear convergence of gradient descent
-
Slower than quadratic convergence of Newton’s method
Convergence Conditions
-
Objective function must be twice continuously differentiable
-
Hessian must be positive definite at the optimum
-
Line search must satisfy Wolfe conditions
-
Starting point must be sufficiently close to optimum
Practical Considerations
Implementation Details
Line Search Strategy
-
Strong Wolfe conditions ensure \(s_k^T y_k > 0\)
-
Backtracking line search is commonly used
-
Exact line search not necessary (and often expensive)
Initialization
-
Usually start with \(H_0 = I\) or \(H_0 = \gamma I\)
-
Can use scaling: \(\gamma = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}}\)
-
Good initialization can improve early convergence
Common Issues and Solutions
Numerical Issues
-
Loss of positive definiteness: Skip update or use damping
-
Ill-conditioning: Regular restarts or modified updates
-
Stagnation: Check for degenerate cases
Performance Tips
-
Use L-BFGS for large problems (\(n > 1000\))
-
Consider preconditioning for ill-conditioned problems
-
Monitor condition number of Hessian approximation
-
Implement efficient matrix operations
Comparison
Comparison with Newton’s Method
-
Newton’s Method:
-
Requires exact Hessian \(\nabla^2 f(x_k)\)
-
Needs \(O(n^3)\) operations per iteration (matrix inversion)
-
May not be positive definite
-
Quadratic convergence near optimum
-
-
Quasi-Newton Methods:
-
Build approximation using only gradient information
-
Maintain positive definiteness (for convex problems)
-
\(O(n^2)\) storage for Hessian approximation
-
Superlinear but not quadratic convergence
-
Comparison with Gradient Descent
-
Gradient Descent:
-
Only uses first-order information
-
\(O(n)\) storage
-
Linear convergence rate
-
May be very slow for ill-conditioned problems
-
-
Quasi-Newton Methods:
-
Incorporate approximate second-order information
-
Faster convergence (superlinear)
-
Better handling of ill-conditioned problems
-
Higher memory requirements
-
Applications
Applications and Use Cases
Typical Applications
-
Parameter estimation in statistical models
-
Machine learning optimization (especially L-BFGS)
-
Engineering design optimization
-
Portfolio optimization in finance
-
Nonlinear least squares problems
When to Choose Quasi-Newton
-
Gradient is available but Hessian is expensive
-
Problem size: medium to large (100-10,000 variables)
-
Objective function is smooth and well-behaved
-
Superlinear convergence is desired
Conclusion
Summary
-
Quasi-Newton methods provide efficient optimization for medium-scale problems
-
BFGS is the most popular variant with good theoretical properties
-
DFP is historically important but less used today
-
L-BFGS extends applicability to large-scale problems
-
Rank-two updates maintain positive definiteness and curvature information
-
Offer a practical compromise between Newton’s method and gradient descent
When to Use
-
Problems with expensive Hessian evaluation
-
Problems where \(O(n^2)\) storage is acceptable (or use L-BFGS)
-
When superlinear convergence is desired
-
Smooth, well-conditioned optimization problems
Further Reading
-
Nocedal, J. & Wright, S. J. (2006). Numerical Optimization. Springer.
-
Fletcher, R. (2013). Practical Methods of Optimization. Wiley.
-
Broyden, C. G. (1970). The convergence of a class of double-rank minimization algorithms. IMA Journal of Applied Mathematics.
-
Liu, D. C. & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming.