
Abstract
There has been a lot of recent “buzz” and hype around scaling up test-time compute, and how it provides a new dimension for improving reasoning and performance of LLMs. In this talk, I will talk about my perspective on the broad question of “what it means to optimize test-time compute and how we could go about it”.
First, I will formalize the problem of optimizing test-time compute as a meta reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute from the lens of exploration and exploitation. This perspective becomes more and more relevant as we scale up test-time token budgets to be very large. It also motivates the use of cumulative regret to measure the efficacy of test-time compute by viewing it as a long output stream, consisting of several episodes from the model. Then, I will show that model allows us to devise a finetuning paradigm that specifically aims to optimize intermediate tokens in a test-time output stream with dense rewards based on how useful steps are (information gain), and show that this is crucial for enabling discovery of novel solutions on hard problems, as I will discuss is the case in practice. To contrast with our approach and to validate our analysis, I would present a simple ablation analysis on some state-of-the-art models (R1), that attempts to understand their behavior and outline some ways of resolving it with information gain.
Then, in the second part of the talk, I will turn to a more formal understanding, and show some of our recent theoretical results on proving that using RL with verification of some sort is critical for enabling effective scaling of test-time compute. We show that even if we were to train on expert search traces (e.g., via STaR or stream search), the suboptimality of expert cloning decays at a much slower rate in the amount of data and the test-time token budget than running RL as long as the base pre-trained LLM admits a heterogeneous distribution, which also puts sufficient mass on trajectories that attain somewhat important. I will show some implications of this result. Overall, this work solidifies the belief that RL is needed for optimizing for test-time compute, and coupled with the first part, presents a new way to think about this topic.
This talk is based on https://blog.ml.cmu.edu/2025/01/08/optimizing-llm-test-time-compute-inv…, an extension paper for this blog, and another paper on theoretical studies of verification vs generation.