🔥 Clarifai Reasoning Engine
Benchmarked by Artificial Analysis on GPT-OSS-120B → 544 tokens/sec, 3.6s TTFA, $0.16/M — Faster, Cheaper, Adaptive
Join the Live Seminar

Research Seminar: How does gradient descent work?

calendarJune 20 | 12 PM ET | 9 am PT
website-picture
Jeremy Cohen
Research fellow at the Flatiron Institute

Talk Abstract

Optimization is the engine of deep learning, yet the theory of optimization has had little impact on the practice of deep learning. Why?

In this talk, we will first show that traditional theories of optimization cannot explain the convergence of the simplest optimization algorithm — deterministic gradient descent — in deep learning. Whereas traditional theories assert that gradient descent converges because the curvature of the loss landscape is “a priori” small, in reality gradient descent converges because it *dynamically avoids* high-curvature regions of the loss landscape. Understanding this behavior requires Taylor expanding to third order, which is one order higher than normally used in optimization theory. While the “fine-grained” dynamics of gradient descent involve chaotic oscillations that are difficult to analyze, we will demonstrate that the “time-averaged” dynamics are, fortunately, much more tractable.

We will show that our time-averaged analysis yields highly accurate quantitative predictions in a variety of deep learning settings. Since gradient descent is the simplest optimization algorithm, we hope this analysis can help point the way towards a mathematical theory of optimization in deep learning.

Get Your Invite

Not you? Click here to reset

Talk Abstract

Optimization is the engine of deep learning, yet the theory of optimization has had little impact on the practice of deep learning. Why?

In this talk, we will first show that traditional theories of optimization cannot explain the convergence of the simplest optimization algorithm — deterministic gradient descent — in deep learning. Whereas traditional theories assert that gradient descent converges because the curvature of the loss landscape is “a priori” small, in reality gradient descent converges because it *dynamically avoids* high-curvature regions of the loss landscape. Understanding this behavior requires Taylor expanding to third order, which is one order higher than normally used in optimization theory. While the “fine-grained” dynamics of gradient descent involve chaotic oscillations that are difficult to analyze, we will demonstrate that the “time-averaged” dynamics are, fortunately, much more tractable.

We will show that our time-averaged analysis yields highly accurate quantitative predictions in a variety of deep learning settings. Since gradient descent is the simplest optimization algorithm, we hope this analysis can help point the way towards a mathematical theory of optimization in deep learning.

Key takeaways

What you will learn:

Why and how to use AI for personalized content and product recommendations
How to organize content by tagging and enriching data and leverage AI models to create personalized content strategies
How to create a common language and framework for AI development to power marketing strategies

Meet the speaker

website-picture-1
Jeremy Cohen
Research fellow at the Flatiron Institute

Jeremy Cohen is a research fellow at the Flatiron Institute. He has recently been working on understanding optimization in deep learning. He obtained his PhD in 2024 from Carnegie Mellon University, advised by Zico Kolter and Ameet Talwalkar.

Want to stay up to date with all the AI tends

Clarifai was built to simplify how developers and teams create, share, and run AI at scale

Acquia DAM
Accelerate data labeling 100X
Whitepaper
Establish an AI Operating Model and get out of prototype and into production

Build your next AI app, test and tune popular LLMs models, and much more.

mesh-gradient
mesh-gradient--2