Results 561 - 570 of 23764

Letter from the Director, January 2026

Happy New Year from Berkeley, where the magnolias are already in bud and we have just welcomed the participants in our Spring 2026 research program on Federated and Collaborative Learning. In addition to the periodic workshops associated with the program, we also have upcoming workshops on various aspects at the nexus of theoretical computer science and machine learning, ranging from the deployment of ML models in social systems, healthcare, and deep learning theory to the impact of techniques developed in learning theory on the theory of computing.

Evaluation Design

Abstract not available

When Prediction Becomes Policy: What Do Risk Models Actually Learn?

Predictive risk models in child welfare are often defended as forecasting devices. In the field, they behave like policy. They reorganize triage, supervision, service allocation, and documentation, and they thereby reshape the administrative record that later stands in for “ground truth.” This talk starts from that pragmatic observation and asks a simple but consequential question: what do risk models actually learn when “risk” is produced through street level bureaucracy, resource constraints, and discretionary judgment rather than revealed as a stable label?

I synthesize a multi method research program grounded in practice, informed by work with child welfare agencies in Wisconsin and carried forward through collaborations in Ontario. The programme spans a systematic review of deployed child welfare algorithms and their targets, ethnographic study of mandated tools embedded in organisational routines, and computational analyses of case narratives that surface invisible labour, shifting needs, and institutional power over time. Across studies, the same pattern recurs: predictors and proxy outcomes frequently encode agency response and surveillance intensity, and the semantics of “risk” drift across the life of a case in ways that standard prediction setups do not represent.

I then extend this argument beyond predictive risk models to contemporary LLM based workflows. In recent work, we use a local LLM together with practitioner labelling to identify service plan goal relevance in case notes, and we trace thematic trajectories over time. The result is both diagnostic and cautionary: as cases become longer and more complex, LLM judgements become less reliable, and the notes increasingly document emergent concerns that sit outside formal service plans. I close by reframing evaluation as an intervention problem and by offering a practical audit and design orientation for aligning targets, data, and governance triggers with accountable service pathways rather than standalone risk scores.

Decision theoretic foundations for human-AI collaboration

Judgments from humans and one or more artificial agents are increasingly combined for decisions, with the expectation that the decisions of the team will outperform those of individual agents. However, simple approaches to providing human decision makers with AI support and evaluating team performance can lead to apparent failures in team performance even when the agents are thought to possess complementary information. I’ll discuss measurement frameworks we’ve developed that apply statistical decision theory and information economics to address questions at the human-agent interface, including how to evaluate when a decision-maker appropriately relies on model predictions, when a human or AI agent could better exploit available contextual information, and how to evaluate (and design) explanations in principled ways.

The Feedback Loop of Statistical Discrimination

We study a dynamic model of interactions between a firm and job applicants to identify mechanisms that drive long-term discrimination. In each round, the firm decides which applicants to hire, where the firm's ability to evaluate applicants is imperfect. Each applicant belongs to a group, and central to our model is the idea that the firm becomes better at evaluating applicants from groups in which they have hired from in the recent past. We establish the firm's initial evaluation ability for each group to be a critical factor in determining long-term outcomes. We show that there is a threshold for which if the firm's initial evaluation ability for the group is below the threshold, the group's hiring rate decreases over time and eventually goes to zero. If the group starts above the threshold, then the hiring rate stabilizes to a positive constant. Therefore, even when two groups are identical in size and underlying skill distribution, a marginal difference in the firm's initial evaluation ability can lead to persistent disparities that exacerbate over time through a feedback loop. Importantly, the dynamic nature of our model allows us to assess the long-term impact of interventions, specifically, whether an improvement is sustained even after the intervention is lifted. In this light, we show that drastic short-term interventions are more effective compared to milder long-term interventions. Additionally, we show that smaller groups face inherent disadvantages, requiring a higher initial evaluation ability to achieve a favorable long-term hiring outcome and experiencing lower hiring rates even when they do.

Beyond Prediction Performance: How Modeling Decisions Shape Fairness Outcomes in Statistical Profiling

When machine learning systems bridge from prediction to intervention—such as in statistical profiling of job seekers—seemingly minor modeling decisions can have profound consequences for who ultimately receives support. This talk examines how different choices in the data science pipeline affect not just predictive accuracy, but the actual composition of individuals flagged for intervention.
Using German administrative labor market data, I present a comparative analysis of regression and machine-learning approaches for predicting long-term unemployment risk. While our models achieve comparable performance (ROC-AUC: 0.70-0.77), they show striking disagreement in which individuals are classified as high-risk, with Jaccard similarities as low as 0.45 between equally accurate models. These differences cascade through the intervention pipeline: classification thresholds, feature importance patterns, and model architectures each reshape the demographic and socioeconomic profile of those targeted for support.
This work highlights a critical challenge at the prediction-intervention interface: the data we use to train accurate prediction models may be sufficient for forecasting outcomes, but the choices we make in constructing those models introduce new forms of variation that directly impact intervention allocation. I discuss implications for documentation, transparency, and fairness in algorithmic decision-making systems, emphasizing that "letting the data speak" still requires researchers to make consequential choices about which predictive voice to amplify. The talk concludes with reflections on how the prediction-to-intervention pipeline demands richer evaluation frameworks that account for both accuracy and equity in resource allocation.

The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior

The Challenge of Valid Evaluations

Machine Learning Who to Nudge: Causal vs Predictive Targeting in a Field Experiment on Student...

Algorithmic UDAP