Introduction to Gradient Descent Method

From Theory to Algorithm

Recall from Previous Lectures:

Necessary conditions for optimality (KKT conditions)
Sufficient conditions using Hessian matrix
Classification of critical points

Focus of this Topic:

How do we find these optimal points algorithmically?
Gradient descent: the foundational iterative method

Central Question

Given \(\min_{x \in \mathbb{R}^n} f(x)\) where \(f\) is smooth, how do we construct a sequence \(\{x_k\}\) that converges to \(x^*\)?

Why Gradient Descent?

Historical Context:

Cauchy (1847): First systematic use of steepest descent
Foundation for modern machine learning algorithms
Still the workhorse for large-scale optimization

Key Applications:

Machine Learning: Training neural networks
Statistics: Maximum likelihood estimation
Engineering: Parameter identification
Finance: Portfolio optimization

Why Start Here?

Understanding gradient descent is crucial for appreciating more sophisticated methods like Newton’s method and quasi-Newton approaches.

Geometric Intuition

Key Insights:

Gradient \(\nabla f(x)\) is perpendicular to level curves
Points in direction of steepest increase
Negative gradient gives steepest decrease

First-Order Taylor Expansion:

\[f(x + d) \approx f(x) + \nabla f(x)^T d\]

To minimize locally, choose \(d = -\alpha \nabla f(x)\)

Intuition: Following the Steepest Descent

At point \(x_k\), the gradient \(\nabla f(x_k)\) points in the direction of steepest ascent
To minimize \(f\), we move in the opposite direction: \(-\nabla f(x_k)\)
This is the direction of steepest descent

Definition: Gradient Descent

Starting from \(x_0 \in \mathbb{R}^n\), generate the sequence:

\[x_{k+1} = x_k - \alpha_k \nabla f(x_k), \quad k = 0, 1, 2, \ldots\]

where \(\alpha_k > 0\) is the step size (learning rate) at iteration \(k\).

The method is also called steepest descent when exact line search is used.

Algorithm Statement

Require Function \( f: \mathbb{R}^n \to \mathbb{R} \), initial point \( x_0 \), tolerance \( \epsilon > 0 \)
\( k \leftarrow 0 \)
while \( \|\nabla f(x_k)\| > \epsilon \) do
1. Choose step size \( \alpha_k > 0 \)
2. \( x_{k+1} \leftarrow x_k - \alpha_k \nabla f(x_k) \)
3. \( k \leftarrow k + 1 \)
Return \( x_k \)

Key Components:

Direction: \(d_k = -\nabla f(x_k)\) (steepest descent direction)
Step size: \(\alpha_k\) (how far to move)
Stopping criterion: \(\|\nabla f(x_k)\| \leq \epsilon\)

Alternative Stopping Criteria

Gradient Norm: \(\|\nabla f(x_k)\| \leq \epsilon_g\)

Most common and theoretically justified
Measures proximity to stationary point

Function Value Change: \(|f(x_k) - f(x_{k-1})| \leq \epsilon_f\)

Practical when function evaluation is expensive
May terminate prematurely in flat regions

Step Size: \(\|x_k - x_{k-1}\| \leq \epsilon_x\)

Useful when solution accuracy is more important than optimality
Can be misleading with small step sizes

Combined Criteria: Often use multiple conditions simultaneously

Critical Choice: Step Size \(\alpha_k\)

Fixed step size: \(\alpha_k = \alpha\) (constant)
Exact line search: \(\alpha_k = \arg\min_{\alpha > 0} f(x_k - \alpha \nabla f(x_k))\)
Inexact line search: Armijo rule, Wolfe conditions
Adaptive methods: AdaGrad, RMSprop, Adam (for ML)

Exact Line Search

At each iteration, solve the one-dimensional optimization problem:

\[\min_{\alpha > 0} \phi(\alpha) = f(x_k - \alpha \nabla f(x_k))\]

Optimality condition: \(\phi'(\alpha_k) = 0\)

\[\Rightarrow \nabla f(x_{k+1})^T \nabla f(x_k) = 0\]

Step Size Selection: Trade-offs

Too Large Step Size:

May overshoot minimum
Can cause divergence
Oscillatory behavior

Too Small Step Size:

Very slow convergence
Many iterations needed
Computational inefficiency

Armijo Rule (Backtracking Line Search)

Idea: Start with a large step size and reduce until sufficient decrease condition is met.

Armijo Condition

Choose \(c_1 \in (0, 1)\) (typically \(c_1 = 10^{-4}\)). Accept \(\alpha_k\) if:

\[f(x_k - \alpha_k \nabla f(x_k)) \leq f(x_k) - c_1 \alpha_k \|\nabla f(x_k)\|^2\]

Backtracking Algorithm:

Start with \(\alpha = \alpha_0\) (e.g., \(\alpha_0 = 1\))
While Armijo condition fails: \(\alpha \leftarrow \rho \alpha\) (e.g., \(\rho = 0.5\))
Accept current \(\alpha\) as \(\alpha_k\)

Advantages: Guarantees sufficient decrease, adaptive to local function behavior

Wolfe Conditions

More sophisticated line search conditions:

Strong Wolfe Conditions

Given \(0 < c_1 < c_2 < 1\), accept \(\alpha_k\) if both:

\[\begin{aligned} f(x_k + \alpha_k d_k) &\leq f(x_k) + c_1 \alpha_k \nabla f(x_k)^T d_k \quad \text{(Armijo)}\\ |\nabla f(x_k + \alpha_k d_k)^T d_k| &\leq c_2 |\nabla f(x_k)^T d_k| \quad \text{(Curvature)} \end{aligned}\]

Typical values: \(c_1 = 10^{-4}\), \(c_2 = 0.9\) (for Newton-type methods) or \(c_2 = 0.1\) (for conjugate gradient)

Benefits:

Ensures both sufficient decrease and sufficient progress
Prevents steps that are too small
Theoretical guarantees for many optimization algorithms

Convergence Theory: Setup

Standard Assumptions:

\(f: \mathbb{R}^n \to \mathbb{R}\) is continuously differentiable
\(f\) is bounded below: \(f(x) \geq f_{\text{inf}} > -\infty\) for all \(x\)
\(\nabla f\) is Lipschitz continuous with constant \(L > 0\):
\[\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\|, \quad \forall x, y \in \mathbb{R}^n\]

Descent Property

If \(0 < \alpha_k < 2/L\), then gradient descent produces a monotone decreasing sequence:

\[f(x_{k+1}) < f(x_k) \quad \text{whenever } \nabla f(x_k) \neq 0\]

Proof idea: Use Taylor expansion and Lipschitz condition to bound \(f(x_{k+1})\).

Descent lemma: Key Technical Result

Descent lemma

If \(\nabla f\) is \(L\)-Lipschitz continuous, then for any \(x, y \in \mathbb{R}^n\):

\[f(y) \leq f(x) + \nabla f(x)^T (y - x) + \frac{L}{2} \|y - x\|^2\]

Application to Gradient Descent: Setting \(y = x_k - \alpha_k \nabla f(x_k)\):

\[\begin{aligned} f(x_{k+1}) &\leq f(x_k) - \alpha_k \|\nabla f(x_k)\|^2 + \frac{L \alpha_k^2}{2} \|\nabla f(x_k)\|^2\\ &= f(x_k) - \alpha_k \left(1 - \frac{L \alpha_k}{2}\right) \|\nabla f(x_k)\|^2 \end{aligned}\]

If \(\alpha_k < 2/L\), then \(f(x_{k+1}) < f(x_k)\) (strict decrease).

Main Convergence Result

Theorem: Convergence of Gradient Descent

Under the Lipschitz assumption with constant \(L\), if we use fixed step size \(0 < \alpha \leq 1/L\), then:

\[\min_{0 \leq j \leq k} \|\nabla f(x_j)\|^2 \leq \frac{2(f(x_0) - f^*)}{(k+1)\alpha}\]

where \(f^* = \inf_x f(x)\).

Implications:

\(\|\nabla f(x_k)\| \to 0\) as \(k \to \infty\) (subsequence convergence)
Convergence rate: \(O(1/k)\) (sublinear)
To achieve \(\|\nabla f(x_k)\| \leq \epsilon\), need \(O(1/\epsilon^2)\) iterations

Note: This is a worst-case bound; actual performance often much better.

Stronger Convergence for Convex Functions

Theorem: Convex Case

If \(f\) is convex and satisfies the Lipschitz condition, then with \(\alpha \leq 1/L\):

\[f(x_k) - f^* \leq \frac{\|x_0 - x^*\|^2}{2\alpha k}\]

Theorem: Strongly Convex Case

If \(f\) is \(\mu\)-strongly convex (\(\mu > 0\)) and \(L\)-smooth, then with \(\alpha = 2/(\mu + L)\):

\[\|x_k - x^*\|^2 \leq \left(\frac{L - \mu}{L + \mu}\right)^{2k} \|x_0 - x^*\|^2\]

This gives linear convergence with rate \(\rho = \frac{L - \mu}{L + \mu} < 1\).

Condition Number: \(\kappa = L/\mu\) determines convergence rate.

Understanding Condition Numbers

Theorem: For strongly convex function, condition number \(\kappa = L/\mu\)
Convergence rate: \(\rho = \frac{\kappa - 1}{\kappa + 1}\)

\(\kappa\)	\(\rho\)	Convergence Quality
1	0	Excellent (one step)
10	0.82	Good
100	0.98	Poor
1000	0.998	Very poor

Key Insight

Gradient descent struggles with ill-conditioned problems where \(\kappa \gg 1\). This motivates preconditioning and second-order methods.

Computational Complexity

Per iteration cost:

Gradient computation: \(O(n)\) for general functions
Vector operations: \(O(n)\)
Line search (if used): depends on function evaluation cost

Memory requirements:

Current point: \(x_k \in \mathbb{R}^n\)
Current gradient: \(\nabla f(x_k) \in \mathbb{R}^n\)
Total: \(O(n)\) memory

Advantages:

Simple implementation
Low memory requirements
Guaranteed descent (with proper step size)
Scales well to high dimensions

Implementation Considerations

Numerical Issues:

Gradient computation accuracy
Finite difference approximations
Round-off errors in step size selection

Practical Tips:

Monitor both gradient norm and function value
Use relative tolerances: \(\|\nabla f(x_k)\| \leq \epsilon_{\text{rel}} \cdot \|\nabla f(x_0)\|\)
Restart if progress stalls
Consider scaling/preconditioning for better conditioning

Debugging:

Verify gradient with finite differences
Check descent property: \(f(x_{k+1}) < f(x_k)\)
Plot convergence history

Limitations of Basic Gradient Descent

Slow convergence: Only \(O(1/k)\) rate in general
Condition number dependence: Poor performance when \(\kappa = L/\mu\) is large
No curvature information: Ignores second-order information
Sensitive to scaling: Performance depends on coordinate system
Local minima: Can get stuck in non-convex problems

Example: Poorly Conditioned Problem

Consider \(f(x_1, x_2) = \frac{1}{2}(x_1^2 + 100x_2^2)\)

Condition number: \(\kappa = 100\)
Gradient descent oscillates in narrow valley
Convergence rate: \(\rho = \frac{99}{101} \approx 0.98\) (very slow!)

Variants and Improvements

Momentum Methods:

Heavy ball method: \(x_{k+1} = x_k - \alpha \nabla f(x_k) + \beta(x_k - x_{k-1})\)
Nesterov acceleration: Look-ahead gradient evaluation

Adaptive Step Sizes:

AdaGrad: Accumulate squared gradients
RMSprop: Exponential moving average of squared gradients
Adam: Combine momentum and adaptive step sizes

Stochastic Variants:

Stochastic Gradient Descent (SGD)
Mini-batch methods
Variance reduction techniques (SVRG, SAGA)

Preview: Advanced Methods

Coming in next lectures:

Newton’s Method: Uses Hessian information
\[x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)\]
Quasi-Newton Methods: Approximate Hessian (BFGS, L-BFGS)
Conjugate Gradient: Exploits quadratic structure
Trust Region Methods: Adaptive step size selection

Key Insight

All these methods can be viewed as variants of the basic iteration:

\[x_{k+1} = x_k + \alpha_k d_k\]

where \(d_k\) is a search direction (not necessarily \(-\nabla f(x_k)\)).

Example 1: Quadratic Function

Consider \(f(x) = \frac{1}{2} x^T A x - b^T x\) where \(A \succ 0\).
Gradient: \(\nabla f(x) = Ax - b\)
Optimal point: \(x^* = A^{-1} b\)
Gradient descent:
\[x_{k+1} = x_k - \alpha (Ax_k - b) = (I - \alpha A) x_k + \alpha b\]
Error analysis: \(e_k = x_k - x^*\) satisfies \(e_{k+1} = (I - \alpha A) e_k\)
Convergence: Linear with rate \(\rho = \max\{|1 - \alpha \lambda_{\min}|, |1 - \alpha \lambda_{\max}|\}\)
Optimal step size: \(\alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}\) gives \(\rho^* = \frac{\kappa - 1}{\kappa + 1}\)

Example 2: Logistic Regression

Problem: Given data \((x_i, y_i)_{i=1}^m\) where \(y_i \in \{0,1\}\), minimize:

\[f(\theta) = \frac{1}{m} \sum_{i=1}^m \log(1 + e^{-y_i x_i^T \theta}) + \frac{\lambda}{2} \|\theta\|^2\]

Gradient:

\[\nabla f(\theta) = -\frac{1}{m} \sum_{i=1}^m \frac{y_i x_i}{1 + e^{y_i x_i^T \theta}} + \lambda \theta\]

Properties:

\(f\) is convex (with regularization \(\lambda > 0\))
\(\nabla f\) is Lipschitz continuous
Gradient descent converges to global optimum
Widely used in machine learning

Example 3: Least Squares with Regularization

Problem: \(\min_x \frac{1}{2}\|Ax - b\|^2 + \frac{\lambda}{2}\|x\|^2\)

Gradient: \(\nabla f(x) = A^T(Ax - b) + \lambda x = (A^TA + \lambda I)x - A^Tb\)

Analysis:

Condition number: \(\kappa = \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda}\) where \(\sigma_1 \geq \cdots \geq \sigma_n \geq 0\)
Regularization improves conditioning
For \(\lambda > 0\), always strongly convex

Practical considerations:

Can solve directly: \(x^* = (A^TA + \lambda I)^{-1} A^Tb\)
Gradient descent useful for very large problems
Iterative methods can be more efficient than direct solve

When to Use Gradient Descent

Good Cases:

Large-scale problems (high dimensional)
When Hessian is expensive or unavailable
Smooth, well-conditioned problems
As a warm-start for other methods
Online/streaming optimization

Poor Cases:

Ill-conditioned problems
When high accuracy is needed quickly
Small problems where direct methods work
Highly non-convex functions with many local minima

Alternatives to Consider:

Newton’s method (if Hessian available)
Quasi-Newton methods (L-BFGS)
Conjugate gradient (for quadratic problems)

Parameter Tuning Guidelines

Step Size Selection:

Start with \(\alpha = 1\) and use backtracking
For quadratic functions: estimate condition number and use \(\alpha = 2/(L + \mu)\)
Monitor convergence: reduce if oscillating, increase if too slow
Consider adaptive methods for complex problems

Stopping Criteria:

Primary: \(\|\nabla f(x_k)\| \leq \epsilon_{abs} + \epsilon_{rel} \|\nabla f(x_0)\|\)
Secondary: Maximum iterations (prevent infinite loops)
Optional: Function value change or step size criteria

Implementation Tips:

Scale variables to similar ranges
Use good initial guess when possible
Monitor and plot convergence history

Common Pitfalls and Debugging

Common Issues:

Divergence: Step size too large, check \(\alpha < 2/L\)
Slow convergence: Problem is ill-conditioned, consider preconditioning
Oscillation: Step size too large or poor scaling
Premature termination: Tolerance too loose or wrong stopping criterion

Debugging Strategies:

Verify gradient implementation with finite differences
Check that function decreases at each iteration
Plot function value and gradient norm vs iteration
Test on simple quadratic functions first
Use line search to automatically adjust step size

Performance Monitoring:

Track: \(f(x_k)\), \(\|\nabla f(x_k)\|\), \(\|x_k - x_{k-1}\|\)
Look for: monotonic decrease, exponential convergence patterns

Key Takeaways

Gradient Descent Fundamentals:

Simple, intuitive algorithm: move opposite to gradient direction
Guaranteed convergence under mild conditions
Sublinear convergence rate: \(O(1/k)\) in general case

Critical Factors:

Step size selection determines success/failure
Problem conditioning (\(\kappa = L/\mu\)) affects convergence speed
Proper scaling and initialization matter significantly

Practical Impact:

Foundation of modern machine learning optimization
Building block for more sophisticated methods
Still widely used for large-scale problems

Connection to Broader Optimization

Gradient Descent as Template:

\[x_{k+1} = x_k + \alpha_k d_k\]

Method	Direction \(d_k\)	Properties
Gradient Descent	\(-\nabla f(x_k)\)	Simple, robust
Newton’s Method	\(-[\nabla^2 f(x_k)]^{-1}\nabla f(x_k)\)	Fast, quadratic convergence
Quasi-Newton	\(-B_k^{-1}\nabla f(x_k)\)	Balance speed/cost
Conjugate Gradient	Conjugate directions	Optimal for quadratic

Unifying Principle: All methods choose direction \(d_k\) and step size \(\alpha_k\) to ensure convergence while optimizing computational efficiency.

Looking Forward

Next Topics:

Newton’s Method: Second-order information for faster convergence
Quasi-Newton Methods: Approximating Hessian efficiently
Constrained Optimization: Handling equality/inequality constraints
Stochastic Methods: Large-scale and noisy optimization

Advanced Topics:

Non-convex optimization landscapes
Parallel and distributed optimization
Online and streaming optimization
Robust optimization under uncertainty

Remember

Understanding gradient descent deeply provides the foundation for mastering all advanced optimization methods.

Next lecture: Newton’s Method and Second-Order Optimization

Motivation and Context

From Theory to Algorithm

Central Question

Why Gradient Descent?

Historical Context:

Key Applications:

Why Start Here?

The Gradient Descent Algorithm

Geometric Intuition

Key Insights:

First-Order Taylor Expansion:

Intuition: Following the Steepest Descent

Definition: Gradient Descent

Algorithm Statement

Alternative Stopping Criteria

Step Size Selection

Critical Choice: Step Size \(\alpha_k\)

Exact Line Search

Step Size Selection: Trade-offs

Too Large Step Size:

Too Small Step Size:

Armijo Rule (Backtracking Line Search)

Armijo Condition

Backtracking Algorithm:

Wolfe Conditions

Strong Wolfe Conditions

Convergence Analysis

Convergence Theory: Setup

Descent Property

Descent lemma: Key Technical Result

Descent lemma

Main Convergence Result

Theorem: Convergence of Gradient Descent

Stronger Convergence for Convex Functions

Theorem: Convex Case

Theorem: Strongly Convex Case

Understanding Condition Numbers

Key Insight

Computational Aspects

Computational Complexity

Implementation Considerations

Limitations and Improvements

Limitations of Basic Gradient Descent

Example: Poorly Conditioned Problem

Variants and Improvements

Preview: Advanced Methods

Key Insight

Examples and Applications

Example 1: Quadratic Function

Example 2: Logistic Regression

Example 3: Least Squares with Regularization

Practical Guidelines

When to Use Gradient Descent

Parameter Tuning Guidelines

Common Pitfalls and Debugging

Summary and Outlook

Key Takeaways

Connection to Broader Optimization

Looking Forward

Remember