Results 101 - 110 of 23736
We present differentially private algorithms for high-dimensional mean estimation. Previous private estimators on distributions over ℝd suffer from a curse of dimensionality, as they require Ω(d1/2) samples to achieve non-trivial error, even in cases where O(1) samples suffice without privacy. This rate is unavoidable when the distribution is isotropic, namely, when the covariance is a multiple of the identity matrix, or when accuracy is measured with respect to the affine-invariant Mahalanobis distance. Yet, real-world data is often highly anisotropic, with signals concentrated on a small number of principal components. We develop estimators that are appropriate for such signals–our estimators are (ε,δ)-differentially private and have sample complexity that is dimension-independent for anisotropic subgaussian distributions. Given n samples from a distribution with known covariance-proxy Σ and unknown mean μ, we present an estimator μ̂ that achieves error ‖μ̂ −μ‖2≤α, as long as n≳tr(Σ)/α2+tr(Σ1/2)/(αε). In particular, when σσ2=(σ21,…,σ2d) are the singular values of Σ, we have tr(Σ)=‖σσ‖22 and tr(Σ1/2)=‖σσ‖1, and hence our bound avoids dimension-dependence when the signal is concentrated in a few principal components. We show that this is the optimal sample complexity for this task up to logarithmic factors. Moreover, for the case of unknown covariance, we present an algorithm whose sample complexity has improved dependence on the dimension, from d1/2 to d1/4.
Federated learning (FL) enables collaborative AI development without centralizing sensitive data. This talk highlights its real-world use in predicting COVID-19 outcomes while preserving patient privacy and data governance. I then introduce the latest innovations in NVIDIA FLARE, the open-source SDK for production-ready FL. New APIs simplify workflow design, while upgrades such as message quantization, native tensor transfer, and model streaming improve communication efficiency. Combined with privacy-preserving techniques like differential privacy, homomorphic encryption, and confidential computing, FLARE integrates seamlessly with popular training libraries and supports customizable workflows—empowering scalable applications in healthcare, biopharma, financial services, and large-scale language models.
I'll discuss models for distributed and private analysis of network data in which nodes retain control of their data. I'll focus particularly on the *local* node-private model, which is a good fit for distributed social network data. We provide the first formulation and systematic investigation of the model, design a new algorithmic framework tailored to it, and develop new lower bound techniques that show fundamental limitations on its power. Joint work with Sofya Raskhodnikova, Connor Wagaman, and Anatoly Zavyalov.
As AI tools become deeply embedded in daily life, analyzing user interactions unlocks critical insights. These conversational logs hold the key to improving AI capabilities and safety, understanding the future of work, tracking shifting societal behaviors, measuring the economic implications of AI adoption, etc. However, unlocking this immense value comes with a steep cost: directly analyzing chat logs risks exposing highly sensitive user information. Anthropic's CLIO represented an important first step in tackling this problem, establishing thoughtful (but heuristic) privacy guarantees to protect user data. Building on this foundation, a critical challenge remains: how can we accurately map the landscape of AI usage with formal mathematical guarantees against eavesdropping?
We first present "Urania", a framework for generating insights from AI usage logs while satisfying end-to-end differential privacy (DP). We will explore how Urania maps individual conversation records to embedding vectors and partitions them using differentially private clustering. From there, it uses partition selection to unpack aggregated keyword histograms into coherent cluster descriptions. We will also discuss the metrics and criteria necessary for evaluating the quality of these private insights.
Finally, we will preview "Calliope", an upcoming method that reimagines this pipeline. Calliope bypasses the often "alien" geometric representations of embeddings by using LLMs directly for hierarchical clustering. By operating natively in the text domain, Calliope overcomes the "utility cliffs" traditionally associated with private geometric clustering, pointing toward a next generation of high-utility, privacy-preserving AI analytics. Some forward looking key challenges will be discussed.
Based on https://arxiv.org/abs/2506.04681, joint work with several amazing collaborators!
Differentially private stochastic gradient descent (DP-SGD) is the most widely used method for training machine learning models with provable privacy guarantees. A key challenge in DP-SGD is setting the per-sample gradient clipping threshold, which significantly affects the trade-off between privacy and utility. While recent adaptive methods improve performance by adjusting this threshold during training, they operate in the standard coordinate system and fail to account for correlations across the coordinates of the gradient. We propose GeoClip, a geometry-aware framework that clips and perturbs gradients in a transformed basis aligned with the geometry of the gradient distribution. GeoClip adaptively estimates this transformation using only previously released noisy gradients, incurring no additional privacy cost. We provide convergence guarantees for GeoClip and derive a closed-form solution for the optimal transformation that minimizes the amount of noise added while keeping the probability of gradient clipping under control. Experiments on both tabular and image datasets demonstrate that GeoClip consistently outperforms existing adaptive clipping methods under the same privacy budget.
A membership-inference attack gets the output of a learning algorithm, and a target individual, and tries to determine whether this individual is a member of the training data or an independent sample from the same distribution. A successful membership-inference attack typically requires the attacker to have some knowledge about the distribution that the training data was sampled from, and this knowledge is often captured through a set of independent reference samples from that distribution. In this work we study how much information the attacker needs for membership inference by investigating the sample complexity-the minimum number of reference samples required-for a successful attack. We study this question in the fundamental setting of Gaussian mean estimation where the learning algorithm is given n samples from a Gaussian distribution (μ,Σ) in d dimensions, and tries to estimate μ̂ up to some error 𝔼[‖μ̂ −μ‖2Σ]≤ρ2d. Our result shows that for membership inference in this setting, Ω(n+n2ρ2) samples can be necessary to carry out any attack that competes with a fully informed attacker. Our result is the first to show that the attacker sometimes needs many more samples than the training algorithm uses to train the model. This result has significant implications for practice, as all attacks used in practice have a restricted form that uses O(n) samples and cannot benefit from ω(n) samples. Thus, these attacks may be underestimating the possibility of membership inference, and better attacks may be possible when information about the distribution is easy to obtain.
I will discuss the recent methods we developed in method to audit differential privacy of ML models without changing the pipeline for training and without additional costs.