Results 251 - 260 of 23739

The Many Faces of Heterogeneity: Federated, Continual, and Modular Learning

Data heterogeneity is a pervasive challenge in modern machine learning, yet it is typically studied in isolation within separate subfields. For instance, in federated learning, statistical heterogeneity across clients causes local models to drift and overspecialize; in continual learning, temporal heterogeneity in data distributions leads to catastrophic forgetting; in model merging and modular approaches, independently trained components make their composition fragile and unpredictable. In this talk, I argue that these are manifestations of the same fundamental tension: learning from diverse, non-stationary, and decentralized sources, while preserving and composing acquired knowledge. I will present a unified perspective that connects these three settings and show how insights can transfer across them, pointing toward a common research agenda: developing methods that embrace heterogeneity as a design principle rather than treating it as an obstacle to overcome.

Understanding Outer Optimizers in Local SGD: Learning Rates, Momentum, and Acceleration

Modern machine learning often requires training with large batch sizes, distributed data, and massively parallel compute hardware. While communication becomes a major bottleneck in such settings, methods like Local Stochastic Gradient Descent (Local SGD) show great promise in reducing this overhead. Local SGD consists of three distinct parts: a local optimization process, an aggregation mechanism, and an outer optimizer that uses aggregated updates to produce a new model.
While extensive literature exists on understanding hyperparameters in the local optimization process, the choice of the outer optimizer and its hyperparameters is less clear. In this talk, we will explore the role of the outer optimizer in Local SGD and present new convergence guarantees for the algorithm. In particular, we will demonstrate that tuning the outer learning rate allows us to (a) trade off between optimization error and stochastic gradient noise variance, and (b) compensate for the ill-tuning of the inner learning rate. Our theory suggests that the outer learning rate should sometimes be set to values greater than 1.
Furthermore, we will extend these results to settings using momentum in the outer optimizer, showing a similar role for the momentum-adjusted outer learning rate. We will also discuss acceleration in the outer optimizer, showing that it improves the convergence rate as a function of the number of communication rounds, improving upon prior algorithms that apply acceleration locally. Finally, we will discuss some experimental validation for our theoretical results and some avenues for future work.

"Federated Reinforcement Learning: Statistical and Communication Trade-offs

Reinforcement learning (RL), concerning decision making in uncertain environments, lies at the heart of modern artificial intelligence. Due to the high dimensionality, training of RL agents typically requires a significant amount of computation and data to achieve desirable performance. However, data collection can be extremely time-consuming with limited access in real-world applications, especially when performed by a single agent. On the other hand, it is plausible to leverage multiple agents to collect data simultaneously, under the premise that they can learn a global policy collaboratively without the need of sharing local data in a federated manner. This talk addresses the fundamental statistical and communication trade-offs in the algorithmic designs of federated RL algorithms, covering both blessings and curses in the presence of data and task heterogeneities across the agents.

Scale Learning and Reasoning Across Heterogeneous Gradients and Semantics

In federated or decentralized environments, tasks clients solve can exhibit different gradients (along with gradient statistics), or even completely different problem semantics (e.g., solving a biology problem versus a CS problem). These forms of heterogeneity pose significant challenges to training and inference. For instance, to realize the benefits of adaptive optimization, it is critical to address heterogeneous preconditioners accumulated on the client side. I first introduce an optimizer Bi^2Clip that approximates adaptive methods without maintaining preconditioners. It leverages coordinate-wise bi-directional clipping, which also helps mitigate issues such as heavy-tailed gradient noise. I then discuss an asynchronous framework based on it that can tolerate heterogeneous hardware with improved dependence on the staleness. Lastly, we shift from gradient/parameter space to semantic space, and present a federation-over-text (FoT) framework designed for heterogeneous tasks and domains. In FoT, instead of transmitting gradients or models, clients iteratively share metacognitive summaries of their local reasoning and planning processes to build a library of reusable insights. I discuss its early applications on math problems solving and ML research insight discovery. Overall, these results show that carefully treating heterogeneity can result in meaningful gains in both final performance and learning/inference efficiency.

From the Ball-proximal (Broximal) Point Method to Efficient Training of LLM

Non-smooth and non-convex global optimization poses significant challenges across various applications, where standard gradient-based methods often struggle. We propose the Ball-Proximal Point Method, Broximal Point Method, or Ball Point Method (BPM) for short – a novel algorithmic framework inspired by the classical Proximal Point Method (PPM) [8], which, as we show, sheds new light on several foundational optimization paradigms and phenomena, including non-convex and non-smooth optimization, acceleration, smoothing, adaptive stepsize selection, and trust-region methods. At the core of BPM lies the ball-proximal (“broximal”) operator, which arises from the classical proximal operator by replacing the quadratic distance penalty by a ball constraint. Surprisingly, and in sharp contrast with the sublinear rate of PPM in the nonsmooth convex regime, we prove that BPM converges linearly and in a finite number of steps in the same regime. Furthermore, by introducing the concept of ball-convexity, we prove that BPM retains the same global convergence guarantees under weaker assumptions, making it a powerful tool for a broader class of potentially non-convex optimization problems. Just like PPM plays the role of a conceptual method inspiring the development of practically efficient algorithms and algorithmic elements, e.g., gradient descent, adaptive step sizes, acceleration [1], and “W” in AdamW [9], we believe that BPM should be understood in the same manner: as a blueprint and inspiration for further development. Generalization non-Euclidean ball constraints can be found in the follow-up work [3].

The Broximal Point Method (BPM) [2] offers an idealized optimization framework based on iteratively minimizing the objective function over norm balls centered at the current iterate. It enjoys striking global convergence guarantees, converging linearly and in a finite number of steps for proper, closed and convex functions. However, its theoretical analysis has so far been confined to the Euclidean geometry. At the same time, emerging trends in deep learning optimization, exemplified by algorithms such as Muon [4] and Scion [6], demonstrate the practical advantages of minimizing over balls defined via non-Euclidean norms which better align with the underlying geometry of the associated loss landscapes. We ask whether the convergence theory of BPM can be extended to this more general, non-Euclidean setting. We give a positive answer, showing that most of the elegant guarantees of the original method carry over to arbitrary norm geometries. Along the way, we clarify which properties are preserved and which necessarily break down when leaving the Euclidean realm. Our analysis positions Non-Euclidean BPM as a conceptual blueprint for understanding a broad class of geometry-aware optimization algorithms, shedding light on the principles behind their practical effectiveness.

Latest developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as Muon [4] and Scion [6]. After over a decade of Adam’s [5] dominance, these LMO-based methods are emerging as viable replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transferability, and most importantly, superior empirical performance on large-scale tasks, including LLM training. However, a significant gap remains between their practical use and our current theoretical understanding: prior analyses (1) overlook the layer-wise LMO application of these optimizers in practice, and (2) rely on an unrealistic smoothness assumption, leading to impractically small stepsizes. To address both, we propose a new LMO-based method called Gluon, capturing prior theoretically analyzed methods as special cases, and introduce a new refined generalized smoothness model that captures the layer-wise geometry of neural networks, matches the layer-wise practical implementation of Muon and Scion, and leads to con- vergence guarantees with strong practical predictive power. Unlike prior results, our theoretical stepsizes closely match the fine-tuned values reported in [6]. Our experiments with NanoGPT and CNN confirm that our assumption holds along the optimization trajectory, ultimately closing the gap between theory and practice.

Recent optimizers like Muon [4], Scion [6], and Gluon [7] have pushed the frontier of large-scale deep learning by exploiting layer-wise linear minimization oracles (LMOs) over non-Euclidean norm balls, capturing neural network structure in ways traditional algorithms cannot. Yet, no principled distributed framework exists for these methods, and communication bottlenecks remain unaddressed. The very few distributed variants are heuristic, with no convergence guarantees in sight. We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based optimizer with rigorous convergence guarantees. EF21-Muon supports stochastic gradients, momentum, and bidirectional compression with error feedback-marking the first extension of error feedback beyond the Euclidean setting. It recovers Muon/Scion/Gluon when compression is off and specific norms are chosen, providing the first efficient distributed implementation of this powerful family. Our theory covers non-Euclidean smooth and the more general communication savings with no accuracy degradation.

References

[1] Kwangjun Ahn and Suvrit Sra. “Understanding Nesterov’s Acceleration via Proximal Point Method”. Symposium on Simplicity in Algorithms (SOSA), 2022, pp. 117–130.

[2] Kaja Gruntkowska, Hanmin Li, Aadi Rane, and Peter Richtárik. “The ball-proximal (=”broximal”) point method: a new algorithm, convergence theory, and applications”. arXiv preprint arXiv:2502.02002, 2025.

[3] Kaja Gruntkowska and Peter Richtárik. Non-Euclidean broximal point method: a blueprint for geometry-aware optimization. arXiv preprint arXiv:2510.00823, 2025

[4] Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.

[5] Diederik P. Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014).

[6] Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. “Training deep learning models with norm-constrained LMOs”. arXiv preprint arXiv:2502.07529 (2025).

[7] Artem Riabinin, Kaja Gruntkowska, Egor Shulgin, and Peter Richtárik. “Gluon: Making Muon and Scion great again! (Bridging theory and practice of LMO-based optimizers for LLMs)”. arXiv preprint arXiv:2505.13416 (2025).

[8] R. T Rockafellar. “Monotone operators and the proximal point algorithm”. In: SIAM Journal on Control and Optimization 14.5 (1976), pp. 877–898.

[9] Z. Zhuang, M. Liu, A. Cutkosky, and F. Orabona. “Understanding AdamW through proximal methods and scale-freeness”. Transactions on Machine Learning Research (2022).

[10] Kaja Gruntkowska, Yassine Maziane, Zheng Qu, and Peter Richtárik. Drop-Muon: Update less, converge faster, arXiv preprint arXiv:2510.02239, 2025.

[11] Kaja Gruntkowska, Alexander Gaponov, Zhirayr Tovmasyan, and Peter Richtárik. Error feedback for Muon and friends, 14th International Conference on Learning Representations (ICLR 2026).

Learning from Heterogeneous Sources

Federated and collaborative learning systems mark a shift from classical data analysis scenarios, where we view samples as coming from a single large underlying population. Instead, techniques for federated and collaborative learning necessitate new...

Details

Responsibly Improving AI with Privacy-Sensitive...

Brendan McMahan (Google)

Richard M. Karp Distinguished Lecture

Tuesday, February 24
3:30 – 4:30 p.m. PT
Calvin Lab auditorium

Details

Responsibly Improving AI with Privacy-Sensitive Data: Principles, Theory, and Practice | Richard...

Robust and Private Federated Learning: Limits and Algorithms