Gradient Descent Method

Motivation and Context

From Theory to Algorithm

Recall from Previous Lectures:

  • Necessary conditions for optimality (KKT conditions)

  • Sufficient conditions using Hessian matrix

  • Classification of critical points

Focus of this Topic:

  • How do we find these optimal points algorithmically?

  • Gradient descent: the foundational iterative method

Central Question

Given \(\min_{x \in \mathbb{R}^n} f(x)\) where \(f\) is smooth, how do we construct a sequence \(\{x_k\}\) that converges to \(x^*\)?

Why Gradient Descent?

Historical Context:

  • Cauchy (1847): First systematic use of steepest descent

  • Foundation for modern machine learning algorithms

  • Still the workhorse for large-scale optimization

Key Applications:

  • Machine Learning: Training neural networks

  • Statistics: Maximum likelihood estimation

  • Engineering: Parameter identification

  • Finance: Portfolio optimization

Why Start Here?

Understanding gradient descent is crucial for appreciating more sophisticated methods like Newton’s method and quasi-Newton approaches.

The Gradient Descent Algorithm

Geometric Intuition

Key Insights:

  • Gradient \(\nabla f(x)\) is perpendicular to level curves

  • Points in direction of steepest increase

  • Negative gradient gives steepest decrease

Gradient descent method
Gradient descent method

First-Order Taylor Expansion:

\[f(x + d) \approx f(x) + \nabla f(x)^T d\]

To minimize locally, choose \(d = -\alpha \nabla f(x)\)

Intuition: Following the Steepest Descent

  • At point \(x_k\), the gradient \(\nabla f(x_k)\) points in the direction of steepest ascent

  • To minimize \(f\), we move in the opposite direction: \(-\nabla f(x_k)\)

  • This is the direction of steepest descent

Definition: Gradient Descent

Starting from \(x_0 \in \mathbb{R}^n\), generate the sequence:

\[x_{k+1} = x_k - \alpha_k \nabla f(x_k), \quad k = 0, 1, 2, \ldots\]

where \(\alpha_k > 0\) is the step size (learning rate) at iteration \(k\).

The method is also called steepest descent when exact line search is used.

Algorithm Statement

  1. Require Function \( f: \mathbb{R}^n \to \mathbb{R} \), initial point \( x_0 \), tolerance \( \epsilon > 0 \)
  2. \( k \leftarrow 0 \)
  3. while \( \|\nabla f(x_k)\| > \epsilon \) do
    1. Choose step size \( \alpha_k > 0 \)
    2. \( x_{k+1} \leftarrow x_k - \alpha_k \nabla f(x_k) \)
    3. \( k \leftarrow k + 1 \)
  4. Return \( x_k \)

Key Components:

  • Direction: \(d_k = -\nabla f(x_k)\) (steepest descent direction)

  • Step size: \(\alpha_k\) (how far to move)

  • Stopping criterion: \(\|\nabla f(x_k)\| \leq \epsilon\)

Alternative Stopping Criteria

Gradient Norm: \(\|\nabla f(x_k)\| \leq \epsilon_g\)

  • Most common and theoretically justified

  • Measures proximity to stationary point

Function Value Change: \(|f(x_k) - f(x_{k-1})| \leq \epsilon_f\)

  • Practical when function evaluation is expensive

  • May terminate prematurely in flat regions

Step Size: \(\|x_k - x_{k-1}\| \leq \epsilon_x\)

  • Useful when solution accuracy is more important than optimality

  • Can be misleading with small step sizes

Combined Criteria: Often use multiple conditions simultaneously

Step Size Selection

Critical Choice: Step Size \(\alpha_k\)

  • Fixed step size: \(\alpha_k = \alpha\) (constant)

  • Exact line search: \(\alpha_k = \arg\min_{\alpha > 0} f(x_k - \alpha \nabla f(x_k))\)

  • Inexact line search: Armijo rule, Wolfe conditions

  • Adaptive methods: AdaGrad, RMSprop, Adam (for ML)

Exact Line Search

At each iteration, solve the one-dimensional optimization problem:

\[\min_{\alpha > 0} \phi(\alpha) = f(x_k - \alpha \nabla f(x_k))\]

Optimality condition: \(\phi'(\alpha_k) = 0\)

\[\Rightarrow \nabla f(x_{k+1})^T \nabla f(x_k) = 0\]

Step Size Selection: Trade-offs

Too Large Step Size:

  • May overshoot minimum

  • Can cause divergence

  • Oscillatory behavior

Too Small Step Size:

  • Very slow convergence

  • Many iterations needed

  • Computational inefficiency

Effect of \(\alpha\)
Effect of \(\alpha\)

Armijo Rule (Backtracking Line Search)

Idea: Start with a large step size and reduce until sufficient decrease condition is met.

Armijo Condition

Choose \(c_1 \in (0, 1)\) (typically \(c_1 = 10^{-4}\)). Accept \(\alpha_k\) if:

\[f(x_k - \alpha_k \nabla f(x_k)) \leq f(x_k) - c_1 \alpha_k \|\nabla f(x_k)\|^2\]

Backtracking Algorithm:

  1. Start with \(\alpha = \alpha_0\) (e.g., \(\alpha_0 = 1\))

  2. While Armijo condition fails: \(\alpha \leftarrow \rho \alpha\) (e.g., \(\rho = 0.5\))

  3. Accept current \(\alpha\) as \(\alpha_k\)

Advantages: Guarantees sufficient decrease, adaptive to local function behavior

Wolfe Conditions

More sophisticated line search conditions:

Strong Wolfe Conditions

Given \(0 < c_1 < c_2 < 1\), accept \(\alpha_k\) if both:

\[\begin{aligned} f(x_k + \alpha_k d_k) &\leq f(x_k) + c_1 \alpha_k \nabla f(x_k)^T d_k \quad \text{(Armijo)}\\ |\nabla f(x_k + \alpha_k d_k)^T d_k| &\leq c_2 |\nabla f(x_k)^T d_k| \quad \text{(Curvature)} \end{aligned}\]

Typical values: \(c_1 = 10^{-4}\), \(c_2 = 0.9\) (for Newton-type methods) or \(c_2 = 0.1\) (for conjugate gradient)

Benefits:

  • Ensures both sufficient decrease and sufficient progress

  • Prevents steps that are too small

  • Theoretical guarantees for many optimization algorithms

Convergence Analysis

Convergence Theory: Setup

Standard Assumptions:

  1. \(f: \mathbb{R}^n \to \mathbb{R}\) is continuously differentiable

  2. \(f\) is bounded below: \(f(x) \geq f_{\text{inf}} > -\infty\) for all \(x\)

  3. \(\nabla f\) is Lipschitz continuous with constant \(L > 0\):

    \[\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\|, \quad \forall x, y \in \mathbb{R}^n\]

Descent Property

If \(0 < \alpha_k < 2/L\), then gradient descent produces a monotone decreasing sequence:

\[f(x_{k+1}) < f(x_k) \quad \text{whenever } \nabla f(x_k) \neq 0\]

Proof idea: Use Taylor expansion and Lipschitz condition to bound \(f(x_{k+1})\).

Descent lemma: Key Technical Result

Descent lemma

If \(\nabla f\) is \(L\)-Lipschitz continuous, then for any \(x, y \in \mathbb{R}^n\):

\[f(y) \leq f(x) + \nabla f(x)^T (y - x) + \frac{L}{2} \|y - x\|^2\]

Application to Gradient Descent: Setting \(y = x_k - \alpha_k \nabla f(x_k)\):

\[\begin{aligned} f(x_{k+1}) &\leq f(x_k) - \alpha_k \|\nabla f(x_k)\|^2 + \frac{L \alpha_k^2}{2} \|\nabla f(x_k)\|^2\\ &= f(x_k) - \alpha_k \left(1 - \frac{L \alpha_k}{2}\right) \|\nabla f(x_k)\|^2 \end{aligned}\]

If \(\alpha_k < 2/L\), then \(f(x_{k+1}) < f(x_k)\) (strict decrease).

Main Convergence Result

Theorem: Convergence of Gradient Descent

Under the Lipschitz assumption with constant \(L\), if we use fixed step size \(0 < \alpha \leq 1/L\), then:

\[\min_{0 \leq j \leq k} \|\nabla f(x_j)\|^2 \leq \frac{2(f(x_0) - f^*)}{(k+1)\alpha}\]
where \(f^* = \inf_x f(x)\).

Implications:

  • \(\|\nabla f(x_k)\| \to 0\) as \(k \to \infty\) (subsequence convergence)

  • Convergence rate: \(O(1/k)\) (sublinear)

  • To achieve \(\|\nabla f(x_k)\| \leq \epsilon\), need \(O(1/\epsilon^2)\) iterations

Note: This is a worst-case bound; actual performance often much better.

Stronger Convergence for Convex Functions

Theorem: Convex Case

If \(f\) is convex and satisfies the Lipschitz condition, then with \(\alpha \leq 1/L\):

\[f(x_k) - f^* \leq \frac{\|x_0 - x^*\|^2}{2\alpha k}\]

Theorem: Strongly Convex Case

If \(f\) is \(\mu\)-strongly convex (\(\mu > 0\)) and \(L\)-smooth, then with \(\alpha = 2/(\mu + L)\):

\[\|x_k - x^*\|^2 \leq \left(\frac{L - \mu}{L + \mu}\right)^{2k} \|x_0 - x^*\|^2\]

This gives linear convergence with rate \(\rho = \frac{L - \mu}{L + \mu} < 1\).

Condition Number: \(\kappa = L/\mu\) determines convergence rate.

Understanding Condition Numbers

  • Theorem: For strongly convex function, condition number \(\kappa = L/\mu\)

  • Convergence rate: \(\rho = \frac{\kappa - 1}{\kappa + 1}\)

\(\kappa\) \(\rho\) Convergence Quality
1 0 Excellent (one step)
10 0.82 Good
100 0.98 Poor
1000 0.998 Very poor

Key Insight

Gradient descent struggles with ill-conditioned problems where \(\kappa \gg 1\). This motivates preconditioning and second-order methods.

Computational Aspects

Computational Complexity

Per iteration cost:

  • Gradient computation: \(O(n)\) for general functions

  • Vector operations: \(O(n)\)

  • Line search (if used): depends on function evaluation cost

Memory requirements:

  • Current point: \(x_k \in \mathbb{R}^n\)

  • Current gradient: \(\nabla f(x_k) \in \mathbb{R}^n\)

  • Total: \(O(n)\) memory

Advantages:

  • Simple implementation

  • Low memory requirements

  • Guaranteed descent (with proper step size)

  • Scales well to high dimensions

Implementation Considerations

Numerical Issues:

  • Gradient computation accuracy

  • Finite difference approximations

  • Round-off errors in step size selection

Practical Tips:

  • Monitor both gradient norm and function value

  • Use relative tolerances: \(\|\nabla f(x_k)\| \leq \epsilon_{\text{rel}} \cdot \|\nabla f(x_0)\|\)

  • Restart if progress stalls

  • Consider scaling/preconditioning for better conditioning

Debugging:

  • Verify gradient with finite differences

  • Check descent property: \(f(x_{k+1}) < f(x_k)\)

  • Plot convergence history

Limitations and Improvements

Limitations of Basic Gradient Descent

  1. Slow convergence: Only \(O(1/k)\) rate in general

  2. Condition number dependence: Poor performance when \(\kappa = L/\mu\) is large

  3. No curvature information: Ignores second-order information

  4. Sensitive to scaling: Performance depends on coordinate system

  5. Local minima: Can get stuck in non-convex problems

Example: Poorly Conditioned Problem

Consider \(f(x_1, x_2) = \frac{1}{2}(x_1^2 + 100x_2^2)\)

  • Condition number: \(\kappa = 100\)

  • Gradient descent oscillates in narrow valley

  • Convergence rate: \(\rho = \frac{99}{101} \approx 0.98\) (very slow!)

Variants and Improvements

Momentum Methods:

  • Heavy ball method: \(x_{k+1} = x_k - \alpha \nabla f(x_k) + \beta(x_k - x_{k-1})\)

  • Nesterov acceleration: Look-ahead gradient evaluation

Adaptive Step Sizes:

  • AdaGrad: Accumulate squared gradients

  • RMSprop: Exponential moving average of squared gradients

  • Adam: Combine momentum and adaptive step sizes

Stochastic Variants:

  • Stochastic Gradient Descent (SGD)

  • Mini-batch methods

  • Variance reduction techniques (SVRG, SAGA)

Preview: Advanced Methods

Coming in next lectures:

  • Newton’s Method: Uses Hessian information

    \[x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)\]

  • Quasi-Newton Methods: Approximate Hessian (BFGS, L-BFGS)

  • Conjugate Gradient: Exploits quadratic structure

  • Trust Region Methods: Adaptive step size selection

Key Insight

All these methods can be viewed as variants of the basic iteration:

\[x_{k+1} = x_k + \alpha_k d_k\]
where \(d_k\) is a search direction (not necessarily \(-\nabla f(x_k)\)).

Examples and Applications

Example 1: Quadratic Function

  • Consider \(f(x) = \frac{1}{2} x^T A x - b^T x\) where \(A \succ 0\).

  • Gradient: \(\nabla f(x) = Ax - b\)

  • Optimal point: \(x^* = A^{-1} b\)

  • Gradient descent:

    \[x_{k+1} = x_k - \alpha (Ax_k - b) = (I - \alpha A) x_k + \alpha b\]

  • Error analysis: \(e_k = x_k - x^*\) satisfies \(e_{k+1} = (I - \alpha A) e_k\)

  • Convergence: Linear with rate \(\rho = \max\{|1 - \alpha \lambda_{\min}|, |1 - \alpha \lambda_{\max}|\}\)

  • Optimal step size: \(\alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}\) gives \(\rho^* = \frac{\kappa - 1}{\kappa + 1}\)

Example 2: Logistic Regression

Problem: Given data \((x_i, y_i)_{i=1}^m\) where \(y_i \in \{0,1\}\), minimize:

\[f(\theta) = \frac{1}{m} \sum_{i=1}^m \log(1 + e^{-y_i x_i^T \theta}) + \frac{\lambda}{2} \|\theta\|^2\]

Gradient:

\[\nabla f(\theta) = -\frac{1}{m} \sum_{i=1}^m \frac{y_i x_i}{1 + e^{y_i x_i^T \theta}} + \lambda \theta\]

Properties:

  • \(f\) is convex (with regularization \(\lambda > 0\))

  • \(\nabla f\) is Lipschitz continuous

  • Gradient descent converges to global optimum

  • Widely used in machine learning

Example 3: Least Squares with Regularization

Problem: \(\min_x \frac{1}{2}\|Ax - b\|^2 + \frac{\lambda}{2}\|x\|^2\)

Gradient: \(\nabla f(x) = A^T(Ax - b) + \lambda x = (A^TA + \lambda I)x - A^Tb\)

Analysis:

  • Condition number: \(\kappa = \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda}\) where \(\sigma_1 \geq \cdots \geq \sigma_n \geq 0\)

  • Regularization improves conditioning

  • For \(\lambda > 0\), always strongly convex

Practical considerations:

  • Can solve directly: \(x^* = (A^TA + \lambda I)^{-1} A^Tb\)

  • Gradient descent useful for very large problems

  • Iterative methods can be more efficient than direct solve

Practical Guidelines

When to Use Gradient Descent

Good Cases:

  • Large-scale problems (high dimensional)

  • When Hessian is expensive or unavailable

  • Smooth, well-conditioned problems

  • As a warm-start for other methods

  • Online/streaming optimization

Poor Cases:

  • Ill-conditioned problems

  • When high accuracy is needed quickly

  • Small problems where direct methods work

  • Highly non-convex functions with many local minima

Alternatives to Consider:

  • Newton’s method (if Hessian available)

  • Quasi-Newton methods (L-BFGS)

  • Conjugate gradient (for quadratic problems)

Parameter Tuning Guidelines

Step Size Selection:

  • Start with \(\alpha = 1\) and use backtracking

  • For quadratic functions: estimate condition number and use \(\alpha = 2/(L + \mu)\)

  • Monitor convergence: reduce if oscillating, increase if too slow

  • Consider adaptive methods for complex problems

Stopping Criteria:

  • Primary: \(\|\nabla f(x_k)\| \leq \epsilon_{abs} + \epsilon_{rel} \|\nabla f(x_0)\|\)

  • Secondary: Maximum iterations (prevent infinite loops)

  • Optional: Function value change or step size criteria

Implementation Tips:

  • Scale variables to similar ranges

  • Use good initial guess when possible

  • Monitor and plot convergence history

Common Pitfalls and Debugging

Common Issues:

  • Divergence: Step size too large, check \(\alpha < 2/L\)

  • Slow convergence: Problem is ill-conditioned, consider preconditioning

  • Oscillation: Step size too large or poor scaling

  • Premature termination: Tolerance too loose or wrong stopping criterion

Debugging Strategies:

  • Verify gradient implementation with finite differences

  • Check that function decreases at each iteration

  • Plot function value and gradient norm vs iteration

  • Test on simple quadratic functions first

  • Use line search to automatically adjust step size

Performance Monitoring:

  • Track: \(f(x_k)\), \(\|\nabla f(x_k)\|\), \(\|x_k - x_{k-1}\|\)

  • Look for: monotonic decrease, exponential convergence patterns

Summary and Outlook

Key Takeaways

Gradient Descent Fundamentals:

  • Simple, intuitive algorithm: move opposite to gradient direction

  • Guaranteed convergence under mild conditions

  • Sublinear convergence rate: \(O(1/k)\) in general case

Critical Factors:

  • Step size selection determines success/failure

  • Problem conditioning (\(\kappa = L/\mu\)) affects convergence speed

  • Proper scaling and initialization matter significantly

Practical Impact:

  • Foundation of modern machine learning optimization

  • Building block for more sophisticated methods

  • Still widely used for large-scale problems

Connection to Broader Optimization

Gradient Descent as Template:

\[x_{k+1} = x_k + \alpha_k d_k\]
Method Direction \(d_k\) Properties
Gradient Descent \(-\nabla f(x_k)\) Simple, robust
Newton’s Method \(-[\nabla^2 f(x_k)]^{-1}\nabla f(x_k)\) Fast, quadratic convergence
Quasi-Newton \(-B_k^{-1}\nabla f(x_k)\) Balance speed/cost
Conjugate Gradient Conjugate directions Optimal for quadratic

Unifying Principle: All methods choose direction \(d_k\) and step size \(\alpha_k\) to ensure convergence while optimizing computational efficiency.

Looking Forward

Next Topics:

  • Newton’s Method: Second-order information for faster convergence

  • Quasi-Newton Methods: Approximating Hessian efficiently

  • Constrained Optimization: Handling equality/inequality constraints

  • Stochastic Methods: Large-scale and noisy optimization

Advanced Topics:

  • Non-convex optimization landscapes

  • Parallel and distributed optimization

  • Online and streaming optimization

  • Robust optimization under uncertainty

Remember

Understanding gradient descent deeply provides the foundation for mastering all advanced optimization methods.

Next lecture: Newton’s Method and Second-Order Optimization