Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed Paper • 2406.04443 • Published Jun 6, 2024
Benchmarking Optimizers for Large Language Model Pretraining Paper • 2509.01440 • Published 5 days ago • 21
Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive Learning Paper • 2404.03323 • Published Apr 4, 2024 • 3