Results 531 - 540 of 23763
Asaf Nachmias is currently a professor of mathematics at Tel Aviv university. His research interests include probability theory, statistical physics, hyperbolic surfaces and circle packing.
Hariharan Narayanan is currently a faculty member at the school of Technology and Computer Science at TIFR Mumbai. His research interests are geometric and probabilistic aspects of high-dimensional phenomena, including the spectral geometry of random...
Zeev Rudnick (Tel Aviv University) was awarded a PhD from Yale University in 1990, followed by positions at Stanford University and Princeton University, before joining Tel-Aviv in 1995. His main research interests lie broadly in Number Theory (especially...
Standard offline evaluations for language models -- a series of independent, state-less inferences made by models -- fail to capture how language models actually behave in practice, where personalization fundamentally alters model behavior. For instance, identical benchmark questions to the same language model can produce markedly different responses when prompted to a state-less system, in one user's chat session, or in a different user's chat session. I'll share empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other provided questions to their chat interfaces. Overall, this underscores the need to evaluate AI systems in terms of the behaviors they exhibit in human interactions, rather than solely through decontextualized, prediction-only outputs.
Performance evaluation plays a central role in decisions about whether and how predictive algorithms should be deployed in high-stakes settings. Yet, in many real-world domains, evaluation is fundamentally difficult: the data available for assessment are often biased, incomplete, or noisy, and the act of deploying a model can itself alter which outcomes are observed. As a result, standard evaluation practices may substantially misrepresent both overall model performance and disparities across groups. In this talk, we examine several common threats to valid evaluation—including measurement error, selection bias, and distribution shift—and present principled evaluation methods that enable valid performance assessment under these challenges when appropriate conditions are met.
In many settings, interventions may be more effective for some individuals than for others, so that targeting interventions may be beneficial. We analyze the value of targeting in the context of a large-scale field experiment with over 53,000 college students, where the goal was to use “nudges” to encourage students to renew their financial-aid applications before a non-binding deadline. We begin with baseline approaches to targeting. First, we target based on a causal forest that assigns students to treatment according to those estimated to have the highest treatment effects. Next, we evaluate two alternative targeting policies, one targeting students with low predicted probability of renewing financial aid in the absence of the treatment, the other targeting those with high probability. The predicted baseline outcome is not the ideal criterion for targeting, nor is it a priori clear whether to prioritize low, high, or intermediate predicted probability. Nonetheless, targeting on low baseline outcomes is common in practice, for example when treatment effects are difficult to estimate. We propose hybrid approaches that incorporate the strengths of predictive approaches (accurate estimation) and causal approaches (correct criterion). We show that targeting intermediate baseline outcomes is most effective in our application, while targeting based on low baseline outcomes is detrimental. In one year of the experiment, nudging all students improved early filing by an average of 6.4 percentage points over a baseline average of 37%, and we estimate that targeting half of the students using our preferred policy attains around 75% of this benefit.
The talk compares two legal frameworks—disparate impact (DI) and unfair, deceptive, or abusive acts or practices (UDAP)—as tools for evaluating algorithmic discrimination, focusing on the example of fair lending. While DI has traditionally served as the foundation of fair lending law, recent regulatory efforts have invoked UDAP, a doctrine rooted in consumer protection, as an alternative means to address algorithmic discrimination harms. We formalize and operationalize both doctrines in a simulated lending setting to assess how they evaluate algorithmic disparities. While some regulatory interpretations treat UDAP as operating similarly to DI, we argue it is an independent and analytically distinct framework. In particular, UDAP’s "unfairness" prong introduces elements such as avoidability of harm and proportionality balancing, while its "deceptive" and "abusive" standards may capture forms of algorithmic harm that elude DI analysis. At the same time, translating UDAP into algorithmic settings exposes unresolved ambiguities, underscoring the need for further regulatory guidance if it is to serve as a workable standard.