Winds of Change: LLMs Become More Thoughtful
by Anil Ananthaswamy (science communicator in residence)
The New York Public Library sells a magnet printed with words by the American author Fran Lebowitz: “Think before you speak. Read before you think.” OpenAI’s latest offering — the o1 suite of large language models (LLMs) — seems to be taking this appeal to heart. The models, according to OpenAI, are “designed to spend more time thinking before they respond.” This allows the models to be more effective when trying to solve complex problems that require them to reason.
Two recent talks at the Simons Institute highlighted this emerging trend. The first was a talk by OpenAI’s Noam Brown — Learning to Reason with LLMs — during the workshop on Transformers as a Computational Model, organized as part of the Special Year on Large Language Models and Transformers research program. The second was Sasha Rush’s Richard M. Karp Distinguished Lecture on Speculations on Test-Time Scaling.
Large language models are trained to predict the next word (or token, to be precise, but word is a good proxy). The training involves taking, say, a sentence from a corpus of text, masking the last word, and asking the LLM to predict the missing word. Initially, the LLM will get it wrong, but as its parameters are tweaked, the model learns the dependencies between the missing word and the other words in the sentence, and eventually correctly predicts the missing word. This is done for all sentences in the training data. Once the LLM is trained, its parameters become a storehouse for the statistical structure of, and the information contained in, the written text used for training.
Using such an LLM — a process called inference — involves prompting the trained LLM with some text. The LLM predicts the next most likely word to follow the prompt, appends it to the original text, predicts the next word, and keeps going, until it generates an end-of-text token or hits some predetermined limit.
One way to get an LLM to answer a query correctly is by providing it with a solved example of some problem as part of the prompt, and then including in the prompt other instances of similar but unsolved problems. Alternately, you can provide the LLM a prompt and include the instruction: solve it step-by-step. Both methods are collectively referred to as chain-of-thought (CoT) prompting. Traditionally, a human user provides the CoT prompt.
In his talk, OpenAI’s Brown told the audience that o1 models have been trained to evaluate many CoT prompts for a given user query and select the best one. This takes time and constitutes “thinking” before answering — and it considerably increases the compute required during inference. And the more the inference-time (also called test-time) compute, the better the model’s performance, as it homes in on the best chain-of-thought prompt to answer the query. “This chain of thought ends up being longer. It ends up being a higher quality than what’s attainable via prompting alone. And it contains a lot of behaviors that you’d want to see from a reasoning model,” said Brown. “For example, it can do error correction — it can recognize when it’s made a mistake and fix it. It can try multiple different strategies, if it notices that one isn’t working. And it can break down a difficult problem into smaller steps to tackle it in a more systematic way.”
While Brown did not explicate in more detail the inner workings of the o1 models, Rush speculated in his talk about how exactly such models might work and discussed their strengths and weaknesses. He identified a set of new research questions enabled by models that use more compute during inference than was the case. “We’re not just talking about it being faster to save more money,” said Rush. “But we are talking about it being faster to actually reach new orders of magnitude in terms of reasoning. I think that’s a really interesting area particularly for inference-time compute.”