Abstract
As AI tools become deeply embedded in daily life, analyzing user interactions unlocks critical insights. These conversational logs hold the key to improving AI capabilities and safety, understanding the future of work, tracking shifting societal behaviors, measuring the economic implications of AI adoption, etc. However, unlocking this immense value comes with a steep cost: directly analyzing chat logs risks exposing highly sensitive user information. Anthropic's CLIO represented an important first step in tackling this problem, establishing thoughtful (but heuristic) privacy guarantees to protect user data. Building on this foundation, a critical challenge remains: how can we accurately map the landscape of AI usage with formal mathematical guarantees against eavesdropping?
We first present "Urania", a framework for generating insights from AI usage logs while satisfying end-to-end differential privacy (DP). We will explore how Urania maps individual conversation records to embedding vectors and partitions them using differentially private clustering. From there, it uses partition selection to unpack aggregated keyword histograms into coherent cluster descriptions. We will also discuss the metrics and criteria necessary for evaluating the quality of these private insights.
Finally, we will preview "Calliope", an upcoming method that reimagines this pipeline. Calliope bypasses the often "alien" geometric representations of embeddings by using LLMs directly for hierarchical clustering. By operating natively in the text domain, Calliope overcomes the "utility cliffs" traditionally associated with private geometric clustering, pointing toward a next generation of high-utility, privacy-preserving AI analytics. Some forward looking key challenges will be discussed.
Based on https://arxiv.org/abs/2506.04681, joint work with several amazing collaborators!