Newton's Method for Nonlinear Optimization

Introduction and Motivation

Newton’s Method - Overview

  • Purpose: Find stationary points of smooth functions

  • Approach: Use second-order Taylor approximation

  • Key Idea: Approximate function locally by quadratic and minimize exactly

  • Advantages:

    • Quadratic convergence near solution

    • Scale-invariant

    • Natural for optimization problems

  • Disadvantages:

    • Requires Hessian computation

    • May not converge globally

    • Expensive per iteration

Historical Context

  • Isaac Newton (1643-1727): Original method for root finding

  • Joseph Raphson (1648-1715): Improved formulation

  • Extension to Optimization:

    • Apply Newton-Raphson to \(\nabla f(x) = 0\)

    • Use second derivatives (Hessian matrix)

  • Modern Development:

    • Convergence analysis (20th century)

    • Globalization strategies

    • Computational considerations

Mathematical Foundation

Taylor Series Approximation

For a twice continuously differentiable function \(f: \mathbb{R}^n \to \mathbb{R}\):

Second-Order Taylor Expansion

\[f(x + p) \approx f(x) + \nabla f(x)^T p + \frac{1}{2} p^T \nabla^2 f(x) p\]

Where:

  • \(\nabla f(x)\) is the gradient vector

  • \(\nabla^2 f(x)\) is the Hessian matrix

  • \(p\) is the step vector

Key Insight: Minimize the quadratic approximation to find the step direction.

Deriving Newton’s Step

Let \(m_k(p) = f(x_k) + \nabla f(x_k)^T p + \frac{1}{2} p^T \nabla^2 f(x_k) p\)

To minimize \(m_k(p)\), set \(\nabla_p m_k(p) = 0\):

\[\nabla_p m_k(p) = \nabla f(x_k) + \nabla^2 f(x_k) p = 0\]

Newton’s Step

\[p_k = -[\nabla^2 f(x_k)]^{-1} \nabla f(x_k)\]

Requirements:

  • \(\nabla^2 f(x_k)\) must be invertible

  • Typically assume \(\nabla^2 f(x_k) \succ 0\) (positive definite)

Newton’s Method Algorithm

  1. Input: Starting point \(x_0\), tolerance \(\epsilon > 0\)
  2. Set \(k = 0\)
  3. While \(\|\nabla f(x_k)\| > \epsilon\):
    • Compute gradient \(g_k = \nabla f(x_k)\)
    • Compute Hessian \(H_k = \nabla^2 f(x_k)\)
    • If \(H_k\) is not positive definite:
      • Apply modification (e.g., add \(\lambda I\))
    • Solve \(H_k p_k = -g_k\) for \(p_k\)
    • Set \(x_{k+1} = x_k + p_k\)
    • \(k = k + 1\)
  4. Output: \(x_k\) (approximate minimizer)

Convergence Analysis

Local Convergence Theory

Local Quadratic Convergence

Let \(f: \mathbb{R}^n \to \mathbb{R}\)be twice continuously differentiable. Suppose \(x^*\) is a local minimizer where:

  • \(\nabla f(x^*) = 0\)

  • \(\nabla^2 f(x^*)\) is positive definite

Then there exists \(\delta > 0\) such that if \(\|x_0 - x^*\| < \delta\), Newton’s method converges quadratically:

\[\|x_{k+1} - x^*\| \leq C \|x_k - x^*\|^2\]

for some constant \(C > 0\).

Interpretation: Near the solution, errors square at each iteration!

Convergence Rate Comparison

Convergence rate comparison
Convergence rate comparison

Note: Newton’s method achieves superlinear convergence even when convergence conditions are relaxed.

Global Convergence Issues

Problems with Pure Newton’s Method:

  • Non-convex functions: May converge to saddle points or maxima

  • Indefinite Hessian: \(\nabla^2 f(x_k)\) not positive definite

  • Poor starting point: May diverge or converge slowly

  • Singular Hessian: \(\nabla^2 f(x_k)\) not invertible

Example: Divergence

Consider \(f(x) = x^3\). Starting from \(x_0 = 1\):

\[x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)} = x_k - \frac{3x_k^2}{6x_k} = \frac{x_k}{2}\]

Converges to \(x = 0\), but \(f'(0) = 0\) and \(f''(0) = 0\) (not a minimum!)

Practical Implementations

Handling Indefinite Hessians

Problem: \(\nabla^2 f(x_k)\) may not be positive definite

Solutions:

  1. Modified Newton Method:

    \[H_k + \lambda_k I \succ 0\]
    Choose \(\lambda_k\) to ensure positive definiteness
  2. Eigenvalue Modification:

    \[H_k = Q \Lambda Q^T \rightarrow Q |\Lambda| Q^T\]

    Replace negative eigenvalues with their absolute values

  3. Cholesky with Pivoting: Add regularization during factorization

Trade-off: Modifications may slow convergence but ensure descent direction.

Line Search Newton Method

Globalization Strategy: Combine Newton direction with line search

  1. Compute Newton step \(p_k = -H_k^{-1} g_k\)
  2. If \(p_k\) is not a descent direction:
    • Modify \(H_k\) or use steepest descent
  3. Choose step size \(\alpha_k\) via line search:
    • \(f(x_k + \alpha_k p_k) < f(x_k) + c_1 \alpha_k g_k^T p_k\)
  4. Set \(x_{k+1} = x_k + \alpha_k p_k\)

Benefits:

  • Guarantees decrease in function value

  • Global convergence under mild conditions

  • Preserves quadratic convergence near solution

Computational Considerations

Cost per Iteration:

  • Gradient computation: \(O(n)\) (usually)

  • Hessian computation: \(O(n^2)\) storage, varies in time

  • Linear system solution: \(O(n^3)\) (dense) or \(O(n^{1.5})\) (sparse)

Hessian Computation Methods:

  1. Analytical: Exact but may be complex to derive

  2. Finite Differences:

    \[\frac{\partial^2 f}{\partial x_i \partial x_j} \approx \frac{f(x + h e_i + h e_j) - f(x + h e_i) - f(x + h e_j) + f(x)}{h^2}\]
  3. Automatic Differentiation: Efficient and exact

Linear System Solution: Use Cholesky decomposition when \(H \succ 0\)

Examples and Applications

Example 1: Quadratic Function

Consider \(f(x) = \frac{1}{2} x^T Q x - b^T x\) where \(Q \succ 0\).

  • \(\nabla f(x) = Qx - b\)

  • \(\nabla^2 f(x) = Q\)

Newton step: \(p_k = -Q^{-1}(Qx_k - b) = Q^{-1}b - x_k\)

Update: \(x_{k+1} = x_k + p_k = Q^{-1}b\)

Result

Newton’s method finds the exact solution in one iteration for quadratic functions!

This is why quadratic convergence is optimal near the solution.

Example 2: Rosenbrock Function

\[f(x_1, x_2) = 100(x_2 - x_1^2)^2 + (1 - x_1)^2\]

Minimum at \((1, 1)\) with \(f(1, 1) = 0\).

Gradient:

\[\nabla f = \begin{pmatrix} -400x_1(x_2 - x_1^2) - 2(1 - x_1) \\ 200(x_2 - x_1^2) \end{pmatrix}\]

Hessian:

\[\nabla^2 f = \begin{pmatrix} -400(x_2 - x_1^2) + 800x_1^2 + 2 & -400x_1 \\ -400x_1 & 200 \end{pmatrix}\]

Challenge: Hessian can be indefinite far from solution, requiring modifications.

Applications

Newton’s Method is Essential in:

  • Machine Learning:

    • Logistic regression (IRLS algorithm)

    • Neural network training (second-order methods)

    • Support vector machines

  • Statistics:

    • Maximum likelihood estimation

    • Generalized linear models

    • Bayesian inference (MAP estimation)

  • Engineering:

    • Structural optimization

    • Control system design

    • Signal processing

  • Economics and Finance:

    • Portfolio optimization

    • Option pricing models

    • Equilibrium analysis

Advantages and Limitations

Advantages of Newton’s Method

  • Quadratic Convergence: Extremely fast near solution

  • Scale Invariance: Performance independent of problem scaling

  • Affine Invariance: Unaffected by linear transformations

  • Natural for Optimization: Derived from optimization principles

  • Exact for Quadratics: Optimal for quadratic functions

  • Well-Understood Theory: Extensive convergence analysis

When to Use Newton’s Method

  • High accuracy required

  • Hessian computation is feasible

  • Problem is well-conditioned

  • Good starting point available

Limitations and Disadvantages

  • Computational Cost: \(O(n^3)\) per iteration

  • Memory Requirements: \(O(n^2)\) for Hessian storage

  • Hessian Computation: May be difficult or expensive

  • Global Convergence: No guarantee without modifications

  • Indefinite Hessian: Requires special handling

  • Singular Points: May fail at saddle points

When NOT to Use Newton’s Method

  • Large-scale problems (\(n > 10^4\))

  • Noisy or discontinuous functions

  • When only low accuracy is needed

  • Hessian is unavailable or expensive

Variants and Extensions

Important Variants

  1. Damped Newton Method:

    \[x_{k+1} = x_k - \alpha_k H_k^{-1} g_k\]

    Use line search to choose \(\alpha_k\)

  2. Trust Region Newton:

    \[\min_{p} \nabla f(x_k)^T p + \frac{1}{2} p^T \nabla^2 f(x_k) p\]
    subject to \(\|p\| \leq \Delta_k\)
  3. Quasi-Newton Methods: Approximate Hessian with updates (BFGS, L-BFGS)

  4. Gauss-Newton: For nonlinear least squares problems

  5. Levenberg-Marquardt: Combines Gauss-Newton with steepest descent

Relationship to Other Methods

Relationship of Newton with other methods
Relationship of Newton with other methods

Newton’s method serves as the foundation for many advanced optimization algorithms.

Conclusion

Summary

Key Takeaways:

  • Newton’s method uses second-order information for fast convergence

  • Achieves quadratic convergence near the solution

  • Requires positive definite Hessian for guaranteed descent

  • Computational cost is \(O(n^3)\) per iteration

  • Global convergence requires modifications (line search, trust region)

  • Exact solution for quadratic functions in one step

  • Foundation for many modern optimization algorithms

Practical Advice

Use Newton’s method when high accuracy is needed and Hessian computation is feasible. For large problems, consider quasi-Newton alternatives.

Further Reading

Recommended Textbooks:

  • Nocedal & Wright: Numerical Optimization

  • Boyd & Vandenberghe: Convex Optimization

  • Bertsekas: Nonlinear Programming

Next Topics:

  • Quasi-Newton Methods (BFGS, L-BFGS)

  • Trust Region Methods

  • Constrained Optimization (KKT conditions)

  • Interior Point Methods