Optimization · Lecture 4

Gradient Descent Method

Nonlinear Optimization

Dr. Mithun Mondal
SECTION 01

Motivation and Context

From Theory to Algorithm

Recall from Previous Lectures:

  • Necessary conditions for optimality (KKT conditions)

  • Sufficient conditions using Hessian matrix

  • Classification of critical points

Focus of this Topic:

  • How do we find these optimal points algorithmically?

  • Gradient descent: the foundational iterative method

Central Question

Given \(\min_{x \in \mathbb{R}^n} f(x)\) where \(f\) is smooth, how do we construct a sequence \(\{x_k\}\) that converges to \(x^*\)?

SECTION 02

Why Gradient Descent?

Why Gradient Descent?

Historical Context:

  • Cauchy (1847): First systematic use of steepest descent

  • Foundation for modern machine learning algorithms

  • Still the workhorse for large-scale optimization

Key Applications:

  • Machine Learning: Training neural networks

  • Statistics: Maximum likelihood estimation

  • Engineering: Parameter identification

  • Finance: Portfolio optimization

Why Start Here?

Understanding gradient descent is crucial for appreciating more sophisticated methods like Newton’s method and quasi-Newton approaches.

SECTION 03

The Gradient Descent Algorithm

Geometric Intuition

Key Insights:

  • Gradient \(\nabla f(x)\) is perpendicular to level curves

  • Points in direction of steepest increase

  • Negative gradient gives steepest decrease

Gradient descent method
Gradient descent method

First-Order Taylor Expansion:

\[f(x + d) \approx f(x) + \nabla f(x)^T d\]

To minimize locally, choose \(d = -\alpha \nabla f(x)\)

SECTION 04

Intuition: Following the Steepest Descent

Intuition: Following the Steepest Descent
  • At point \(x_k\), the gradient \(\nabla f(x_k)\) points in the direction of steepest ascent

  • To minimize \(f\), we move in the opposite direction: \(-\nabla f(x_k)\)

  • This is the direction of steepest descent

Definition: Gradient Descent

Starting from \(x_0 \in \mathbb{R}^n\), generate the sequence:

\[x_{k+1} = x_k - \alpha_k \nabla f(x_k), \quad k = 0, 1, 2, \ldots\]

where \(\alpha_k > 0\) is the step size (learning rate) at iteration \(k\).

The method is also called steepest descent when exact line search is used.

SECTION 05

Algorithm Statement

Algorithm Statement
  1. Require Function \( f: \mathbb{R}^n \to \mathbb{R} \), initial point \( x_0 \), tolerance \( \epsilon > 0 \)
  2. \( k \leftarrow 0 \)
  3. while \( \|\nabla f(x_k)\| > \epsilon \) do
    1. Choose step size \( \alpha_k > 0 \)
    2. \( x_{k+1} \leftarrow x_k - \alpha_k \nabla f(x_k) \)
    3. \( k \leftarrow k + 1 \)
  4. Return \( x_k \)

Key Components:

  • Direction: \(d_k = -\nabla f(x_k)\) (steepest descent direction)

  • Step size: \(\alpha_k\) (how far to move)

  • Stopping criterion: \(\|\nabla f(x_k)\| \leq \epsilon\)

SECTION 06

Alternative Stopping Criteria

Alternative Stopping Criteria

Gradient Norm: \(\|\nabla f(x_k)\| \leq \epsilon_g\)

  • Most common and theoretically justified

  • Measures proximity to stationary point

Function Value Change: \(|f(x_k) - f(x_{k-1})| \leq \epsilon_f\)

  • Practical when function evaluation is expensive

  • May terminate prematurely in flat regions

Step Size: \(\|x_k - x_{k-1}\| \leq \epsilon_x\)

  • Useful when solution accuracy is more important than optimality

  • Can be misleading with small step sizes

Combined Criteria: Often use multiple conditions simultaneously

SECTION 07

Step Size Selection

Critical Choice: Step Size \(\alpha_k\)
  • Fixed step size: \(\alpha_k = \alpha\) (constant)

  • Exact line search: \(\alpha_k = \arg\min_{\alpha > 0} f(x_k - \alpha \nabla f(x_k))\)

  • Inexact line search: Armijo rule, Wolfe conditions

  • Adaptive methods: AdaGrad, RMSprop, Adam (for ML)

Exact Line Search

At each iteration, solve the one-dimensional optimization problem:

\[\min_{\alpha > 0} \phi(\alpha) = f(x_k - \alpha \nabla f(x_k))\]

Optimality condition: \(\phi'(\alpha_k) = 0\)

\[\Rightarrow \nabla f(x_{k+1})^T \nabla f(x_k) = 0\]
SECTION 08

Step Size Selection: Trade-offs

Step Size Selection: Trade-offs

Too Large Step Size:

  • May overshoot minimum

  • Can cause divergence

  • Oscillatory behavior

Too Small Step Size:

  • Very slow convergence

  • Many iterations needed

  • Computational inefficiency

Effect of \(\alpha\)
Effect of \(\alpha\)
Armijo Rule (Backtracking Line Search)

Idea: Start with a large step size and reduce until sufficient decrease condition is met.

Armijo Condition

Choose \(c_1 \in (0, 1)\) (typically \(c_1 = 10^{-4}\)). Accept \(\alpha_k\) if:

\[f(x_k - \alpha_k \nabla f(x_k)) \leq f(x_k) - c_1 \alpha_k \|\nabla f(x_k)\|^2\]

Backtracking Algorithm:

  1. Start with \(\alpha = \alpha_0\) (e.g., \(\alpha_0 = 1\))

  2. While Armijo condition fails: \(\alpha \leftarrow \rho \alpha\) (e.g., \(\rho = 0.5\))

  3. Accept current \(\alpha\) as \(\alpha_k\)

Advantages: Guarantees sufficient decrease, adaptive to local function behavior

SECTION 10

Wolfe Conditions

Wolfe Conditions

More sophisticated line search conditions:

Strong Wolfe Conditions

Given \(0 < c_1 < c_2 < 1\), accept \(\alpha_k\) if both:

\[\begin{aligned} f(x_k + \alpha_k d_k) &\leq f(x_k) + c_1 \alpha_k \nabla f(x_k)^T d_k \quad \text{(Armijo)}\\ |\nabla f(x_k + \alpha_k d_k)^T d_k| &\leq c_2 |\nabla f(x_k)^T d_k| \quad \text{(Curvature)} \end{aligned}\]

Typical values: \(c_1 = 10^{-4}\), \(c_2 = 0.9\) (for Newton-type methods) or \(c_2 = 0.1\) (for conjugate gradient)

Benefits:

  • Ensures both sufficient decrease and sufficient progress

  • Prevents steps that are too small

  • Theoretical guarantees for many optimization algorithms

SECTION 11

Convergence Analysis

Convergence Theory: Setup

Standard Assumptions:

  1. \(f: \mathbb{R}^n \to \mathbb{R}\) is continuously differentiable

  2. \(f\) is bounded below: \(f(x) \geq f_{\text{inf}} > -\infty\) for all \(x\)

  3. \(\nabla f\) is Lipschitz continuous with constant \(L > 0\):

    \[\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\|, \quad \forall x, y \in \mathbb{R}^n\]

Descent Property

If \(0 < \alpha_k < 2/L\), then gradient descent produces a monotone decreasing sequence:

\[f(x_{k+1}) < f(x_k) \quad \text{whenever } \nabla f(x_k) \neq 0\]

Proof idea: Use Taylor expansion and Lipschitz condition to bound \(f(x_{k+1})\).

SECTION 12

Descent lemma: Key Technical Result

Descent lemma: Key Technical Result

Descent lemma

If \(\nabla f\) is \(L\)-Lipschitz continuous, then for any \(x, y \in \mathbb{R}^n\):

\[f(y) \leq f(x) + \nabla f(x)^T (y - x) + \frac{L}{2} \|y - x\|^2\]

Application to Gradient Descent: Setting \(y = x_k - \alpha_k \nabla f(x_k)\):

\[\begin{aligned} f(x_{k+1}) &\leq f(x_k) - \alpha_k \|\nabla f(x_k)\|^2 + \frac{L \alpha_k^2}{2} \|\nabla f(x_k)\|^2\\ &= f(x_k) - \alpha_k \left(1 - \frac{L \alpha_k}{2}\right) \|\nabla f(x_k)\|^2 \end{aligned}\]

If \(\alpha_k < 2/L\), then \(f(x_{k+1}) < f(x_k)\) (strict decrease).

SECTION 13

Main Convergence Result

Main Convergence Result

Theorem: Convergence of Gradient Descent

Under the Lipschitz assumption with constant \(L\), if we use fixed step size \(0 < \alpha \leq 1/L\), then:

\[\min_{0 \leq j \leq k} \|\nabla f(x_j)\|^2 \leq \frac{2(f(x_0) - f^*)}{(k+1)\alpha}\]
where \(f^* = \inf_x f(x)\).

Implications:

  • \(\|\nabla f(x_k)\| \to 0\) as \(k \to \infty\) (subsequence convergence)

  • Convergence rate: \(O(1/k)\) (sublinear)

  • To achieve \(\|\nabla f(x_k)\| \leq \epsilon\), need \(O(1/\epsilon^2)\) iterations

Note: This is a worst-case bound; actual performance often much better.

SECTION 14

Stronger Convergence for Convex Functions

Stronger Convergence for Convex Functions

Theorem: Convex Case

If \(f\) is convex and satisfies the Lipschitz condition, then with \(\alpha \leq 1/L\):

\[f(x_k) - f^* \leq \frac{\|x_0 - x^*\|^2}{2\alpha k}\]

Theorem: Strongly Convex Case

If \(f\) is \(\mu\)-strongly convex (\(\mu > 0\)) and \(L\)-smooth, then with \(\alpha = 2/(\mu + L)\):

\[\|x_k - x^*\|^2 \leq \left(\frac{L - \mu}{L + \mu}\right)^{2k} \|x_0 - x^*\|^2\]

This gives linear convergence with rate \(\rho = \frac{L - \mu}{L + \mu} < 1\).

Condition Number: \(\kappa = L/\mu\) determines convergence rate.

SECTION 15

Understanding Condition Numbers

Understanding Condition Numbers
  • Theorem: For strongly convex function, condition number \(\kappa = L/\mu\)

  • Convergence rate: \(\rho = \frac{\kappa - 1}{\kappa + 1}\)

\(\kappa\) \(\rho\) Convergence Quality
1 0 Excellent (one step)
10 0.82 Good
100 0.98 Poor
1000 0.998 Very poor

Key Insight

Gradient descent struggles with ill-conditioned problems where \(\kappa \gg 1\). This motivates preconditioning and second-order methods.

SECTION 16

Computational Aspects

Computational Complexity

Per iteration cost:

  • Gradient computation: \(O(n)\) for general functions

  • Vector operations: \(O(n)\)

  • Line search (if used): depends on function evaluation cost

Memory requirements:

  • Current point: \(x_k \in \mathbb{R}^n\)

  • Current gradient: \(\nabla f(x_k) \in \mathbb{R}^n\)

  • Total: \(O(n)\) memory

Advantages:

  • Simple implementation

  • Low memory requirements

  • Guaranteed descent (with proper step size)

  • Scales well to high dimensions

SECTION 17

Implementation Considerations

Implementation Considerations

Numerical Issues:

  • Gradient computation accuracy

  • Finite difference approximations

  • Round-off errors in step size selection

Practical Tips:

  • Monitor both gradient norm and function value

  • Use relative tolerances: \(\|\nabla f(x_k)\| \leq \epsilon_{\text{rel}} \cdot \|\nabla f(x_0)\|\)

  • Restart if progress stalls

  • Consider scaling/preconditioning for better conditioning

Debugging:

  • Verify gradient with finite differences

  • Check descent property: \(f(x_{k+1}) < f(x_k)\)

  • Plot convergence history

SECTION 18

Limitations and Improvements

Limitations of Basic Gradient Descent
  1. Slow convergence: Only \(O(1/k)\) rate in general

  2. Condition number dependence: Poor performance when \(\kappa = L/\mu\) is large

  3. No curvature information: Ignores second-order information

  4. Sensitive to scaling: Performance depends on coordinate system

  5. Local minima: Can get stuck in non-convex problems

Example: Poorly Conditioned Problem

Consider \(f(x_1, x_2) = \frac{1}{2}(x_1^2 + 100x_2^2)\)

  • Condition number: \(\kappa = 100\)

  • Gradient descent oscillates in narrow valley

  • Convergence rate: \(\rho = \frac{99}{101} \approx 0.98\) (very slow!)

SECTION 19

Variants and Improvements

Variants and Improvements

Momentum Methods:

  • Heavy ball method: \(x_{k+1} = x_k - \alpha \nabla f(x_k) + \beta(x_k - x_{k-1})\)

  • Nesterov acceleration: Look-ahead gradient evaluation

Adaptive Step Sizes:

  • AdaGrad: Accumulate squared gradients

  • RMSprop: Exponential moving average of squared gradients

  • Adam: Combine momentum and adaptive step sizes

Stochastic Variants:

  • Stochastic Gradient Descent (SGD)

  • Mini-batch methods

  • Variance reduction techniques (SVRG, SAGA)

SECTION 20

Preview: Advanced Methods

Preview: Advanced Methods

Coming in next lectures:

  • Newton’s Method: Uses Hessian information

    \[x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)\]
  • Quasi-Newton Methods: Approximate Hessian (BFGS, L-BFGS)

  • Conjugate Gradient: Exploits quadratic structure

  • Trust Region Methods: Adaptive step size selection

Key Insight

All these methods can be viewed as variants of the basic iteration:

\[x_{k+1} = x_k + \alpha_k d_k\]
where \(d_k\) is a search direction (not necessarily \(-\nabla f(x_k)\)).
SECTION 21

Examples and Applications

Example 1: Quadratic Function
  • Consider \(f(x) = \frac{1}{2} x^T A x - b^T x\) where \(A \succ 0\).

  • Gradient: \(\nabla f(x) = Ax - b\)

  • Optimal point: \(x^* = A^{-1} b\)

  • Gradient descent:

    \[x_{k+1} = x_k - \alpha (Ax_k - b) = (I - \alpha A) x_k + \alpha b\]
  • Error analysis: \(e_k = x_k - x^*\) satisfies \(e_{k+1} = (I - \alpha A) e_k\)

  • Convergence: Linear with rate \(\rho = \max\{|1 - \alpha \lambda_{\min}|, |1 - \alpha \lambda_{\max}|\}\)

  • Optimal step size: \(\alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}\) gives \(\rho^* = \frac{\kappa - 1}{\kappa + 1}\)

SECTION 22

Example 2: Logistic Regression

Example 2: Logistic Regression

Problem: Given data \((x_i, y_i)_{i=1}^m\) where \(y_i \in \{0,1\}\), minimize:

\[f(\theta) = \frac{1}{m} \sum_{i=1}^m \log(1 + e^{-y_i x_i^T \theta}) + \frac{\lambda}{2} \|\theta\|^2\]

Gradient:

\[\nabla f(\theta) = -\frac{1}{m} \sum_{i=1}^m \frac{y_i x_i}{1 + e^{y_i x_i^T \theta}} + \lambda \theta\]

Properties:

  • \(f\) is convex (with regularization \(\lambda > 0\))

  • \(\nabla f\) is Lipschitz continuous

  • Gradient descent converges to global optimum

  • Widely used in machine learning

SECTION 23

Example 3: Least Squares with Regularization

Example 3: Least Squares with Regularization

Problem: \(\min_x \frac{1}{2}\|Ax - b\|^2 + \frac{\lambda}{2}\|x\|^2\)

Gradient: \(\nabla f(x) = A^T(Ax - b) + \lambda x = (A^TA + \lambda I)x - A^Tb\)

Analysis:

  • Condition number: \(\kappa = \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda}\) where \(\sigma_1 \geq \cdots \geq \sigma_n \geq 0\)

  • Regularization improves conditioning

  • For \(\lambda > 0\), always strongly convex

Practical considerations:

  • Can solve directly: \(x^* = (A^TA + \lambda I)^{-1} A^Tb\)

  • Gradient descent useful for very large problems

  • Iterative methods can be more efficient than direct solve

SECTION 24

Practical Guidelines

When to Use Gradient Descent

Good Cases:

  • Large-scale problems (high dimensional)

  • When Hessian is expensive or unavailable

  • Smooth, well-conditioned problems

  • As a warm-start for other methods

  • Online/streaming optimization

Poor Cases:

  • Ill-conditioned problems

  • When high accuracy is needed quickly

  • Small problems where direct methods work

  • Highly non-convex functions with many local minima

Alternatives to Consider:

  • Newton’s method (if Hessian available)

  • Quasi-Newton methods (L-BFGS)

  • Conjugate gradient (for quadratic problems)

SECTION 25

Parameter Tuning Guidelines

Parameter Tuning Guidelines

Step Size Selection:

  • Start with \(\alpha = 1\) and use backtracking

  • For quadratic functions: estimate condition number and use \(\alpha = 2/(L + \mu)\)

  • Monitor convergence: reduce if oscillating, increase if too slow

  • Consider adaptive methods for complex problems

Stopping Criteria:

  • Primary: \(\|\nabla f(x_k)\| \leq \epsilon_{abs} + \epsilon_{rel} \|\nabla f(x_0)\|\)

  • Secondary: Maximum iterations (prevent infinite loops)

  • Optional: Function value change or step size criteria

Implementation Tips:

  • Scale variables to similar ranges

  • Use good initial guess when possible

  • Monitor and plot convergence history

SECTION 26

Common Pitfalls and Debugging

Common Pitfalls and Debugging

Common Issues:

  • Divergence: Step size too large, check \(\alpha < 2/L\)

  • Slow convergence: Problem is ill-conditioned, consider preconditioning

  • Oscillation: Step size too large or poor scaling

  • Premature termination: Tolerance too loose or wrong stopping criterion

Debugging Strategies:

  • Verify gradient implementation with finite differences

  • Check that function decreases at each iteration

  • Plot function value and gradient norm vs iteration

  • Test on simple quadratic functions first

  • Use line search to automatically adjust step size

Performance Monitoring:

  • Track: \(f(x_k)\), \(\|\nabla f(x_k)\|\), \(\|x_k - x_{k-1}\|\)

  • Look for: monotonic decrease, exponential convergence patterns

SECTION 27

Summary and Outlook

Key Takeaways

Gradient Descent Fundamentals:

  • Simple, intuitive algorithm: move opposite to gradient direction

  • Guaranteed convergence under mild conditions

  • Sublinear convergence rate: \(O(1/k)\) in general case

Critical Factors:

  • Step size selection determines success/failure

  • Problem conditioning (\(\kappa = L/\mu\)) affects convergence speed

  • Proper scaling and initialization matter significantly

Practical Impact:

  • Foundation of modern machine learning optimization

  • Building block for more sophisticated methods

  • Still widely used for large-scale problems

SECTION 28

Connection to Broader Optimization

Connection to Broader Optimization

Gradient Descent as Template:

\[x_{k+1} = x_k + \alpha_k d_k\]
Method Direction \(d_k\) Properties
Gradient Descent \(-\nabla f(x_k)\) Simple, robust
Newton’s Method \(-[\nabla^2 f(x_k)]^{-1}\nabla f(x_k)\) Fast, quadratic convergence
Quasi-Newton \(-B_k^{-1}\nabla f(x_k)\) Balance speed/cost
Conjugate Gradient Conjugate directions Optimal for quadratic

Unifying Principle: All methods choose direction \(d_k\) and step size \(\alpha_k\) to ensure convergence while optimizing computational efficiency.

SECTION 29

Looking Forward

Looking Forward

Next Topics:

  • Newton’s Method: Second-order information for faster convergence

  • Quasi-Newton Methods: Approximating Hessian efficiently

  • Constrained Optimization: Handling equality/inequality constraints

  • Stochastic Methods: Large-scale and noisy optimization

Advanced Topics:

  • Non-convex optimization landscapes

  • Parallel and distributed optimization

  • Online and streaming optimization

  • Robust optimization under uncertainty

Remember

Understanding gradient descent deeply provides the foundation for mastering all advanced optimization methods.

Next lecture: Newton’s Method and Second-Order Optimization