Results 221 - 230 of 23737
Large language models are often limited by data, especially when valuable datasets are distributed across institutions or cannot be shared. We introduce FlexOlmo, a new class of Mixture-of-Experts (MoE) models designed for flexible, modular data use. In FlexOlmo, expert modules are trained independently on separate datasets and later merged seamlessly into a single model. This enables distributed training without data sharing, supports the use of closed datasets, and allows data to be opt-in or opt-out at inference time. We scale FlexOlmo to 37B parameters (20B active) and evaluate on 31 diverse downstream tasks. FlexOlmo significantly outperforms models trained on public data only and approaches the performance of an upper-bound model trained on all datasets. By enabling modular integration of closed data while respecting data ownership and control, FlexOlmo offers a practical path toward collaborative, continuous model development.
Large language models (LLMs) have not yet effectively leveraged the vast amounts of data available on edge devices. Federated learning (FL) offers a promising way to collaboratively fine-tune LLMs without transferring private edge data to the cloud. To work within the computation and communication constraints of edge devices, recent research on federated fine-tuning of LLMs uses low-rank adaptation (LoRA) and similar parameter-efficient methods. LoRA-based methods suffer from accuracy loss in FL settings, primarily due to data and computational heterogeneity across clients. In this talk, I will first discuss an adaptive multi-head LoRA method that balances parameter efficiency and model expressivity by reparameterizing weight updates as the sum of multiple LoRA heads. In the second part of my talk, I will discuss other ways to leverage edge data, such as one-shot merging of locally trained models or training query routers personalized to each client's edge data.
We provide a brief introduction to local update methods developed for federated optimization and discuss their worst-case complexity. Surprisingly, these methods often perform much better in practice than predicted by theoretical analyses using classical assumptions. Recent years have revealed that their performance can be better described using refined notions that capture the similarity among client objectives. In this talk, we introduce a generic framework based on a distributed proximal point algorithm, which consolidates many of our insights and allows for the adaptation of arbitrary centralized optimization algorithms to the convex federated setting, including accelerated variants. Our theoretical analysis shows that the derived methods enjoy faster convergence when the degree of similarity among clients is high.
Based on joint work with Xiaowen Jiang and Anton Rodomanov.