Metacognition and Related Abilities of Large Language Models
by Anil Ananthaswamy (2024–25 science communicator in residence)
Since they burst into public consciousness about three years ago, large language models (LLMs) have attracted unusual monikers. In 2021, for example, LLMs were called “stochastic parrots” to argue that these machine learning models simply regurgitate highly probable recombinations of linguistic forms that they encounter in the training data, without displaying actual understanding of language.
During her talk this April at the Simons Institute’s workshop on The Future of Language Models and Transformers, Azalia Mirhoseini of Stanford University and Google DeepMind called her team’s project large language monkeys, though not pejoratively. Referring to the idea that even a monkey, given infinite time to tap keys at random on a typewriter, might type out something that Shakespeare wrote, Mirhoseini used the phrase to suggest that even small LLMs might “know” more than is obvious at first and can be made to answer questions correctly given enough compute.
This theme — about LLMs and the knowledge they contain — played out in other talks in the same workshop, with speakers arguing that LLMs not only know, but also know that they know — an ability that can loosely be called metacognition, the dangers of anthropomorphizing machines notwithstanding. The researchers also demonstrated techniques that leveraged the metacognition of the larger models and the inherent knowledge contained within the smaller models to imbue the latter with augmented abilities that compared well with that of their bigger brethren; such competent smaller models would consume less compute when queried during inference.
When one prompts a small LLM, such as Meta’s Llama-3-8B-Instruct, to solve some coding or math problem, it might get the answer wrong. The LLM has only 8 billion parameters, which are the variables whose values are learned during training; this is considerably fewer than the parameter counts of industry behemoths like OpenAI’s GPT-4o. Mirhoseini’s team showed one way to extract information out of the smaller model, and eventually have it outperform the larger model.
To do so, the researchers first increased the “temperature” of the smaller model, which makes the LLM generate different responses each time, given the same prompt. They then repeatedly prompted the model with the same question to get a slew of answers and used an external checker to pick out the correct answer (this works only if the LLM’s output is amenable to automated testing — for example, a coding task can be checked using a tool that runs unit tests on the code produced by the LLM). The team showed that this method, which requires increasing the amount of compute during the inference phase when the small LLM is repeatedly responding to the same prompt, allowed Llama-3-8B-Instruct to eventually do better than GPT-4o (which was prompted only once).
“It seems that these small models already know the answers to some of these really hard coding and math problems. It’s just that they don’t output that in the first try. So, we need to repeatedly sample from them,” said Mirhoseini.
A somewhat different take on what LLMs know came from Sanjeev Arora of Princeton University, who described his lab’s efforts at augmenting the behavior of small models by tapping into the metacognitive abilities of bigger models. One such effort involved getting a small model to become better at following user instructions. Normally, a pretrained LLM is good at next-word prediction: given a prompt, it generates the words most likely to follow the prompt. Such an LLM has to be further trained or fine-tuned to become a chatbot capable of following instructions, and this is often done using something called instruction tuning, which requires a large dataset of questions and answers generated and curated by humans — an expensive proposition.
To automate the creation of a small but effective dataset for instruction tuning, Arora’s team first prompted a frontier LLM (such as GPT-4o) to identify, say, a thousand skills that it deemed necessary to follow instructions. GPT-4o produced such a list: an indication of its metacognition, or knowledge about its own abilities. The team then picked pairs of skills at random from the larger list and asked GPT-4o to generate question-answer pairs that required exercising those skills. The team used GPT-4o to produce a dataset of about 4,000 such Q&A pairs, and used this synthetic dataset to fine-tune a small model — in this case the base Llama-3-8B model — to follow instructions. The fine-tuned Llama-3-8B outperformed Claude 3 Opus and Llama-3.1-405B-Instruct, both significantly bigger models. “This was way, way better than anything anybody had ever achieved, starting from a base model,” said Arora.
If modern LLMs are said to be reasoning or thinking, and have thoughts about their thoughts, can the study of these abilities be formalized? That’s exactly what Siva Reddy, of McGill University and Mila - Quebec Artificial Intelligence Institute, and his team are proposing.
One way to get an LLM to reason is to give it an example of the chain of thought that a human might use to answer some question. The LLM can then use this as a template to generate its own chain of thought to answer a related question. This was the state of the art in the middle of 2024. Then came the newer generation of LLMs, called large reasoning models (LRMs), which are trained to generate numerous chains of thought on their own, in response to a prompt, and select the best one in order to answer a query. While researchers are not privy to the internal ruminations of frontier LRMs released by companies like OpenAI and Anthropic, they do have access to the chains of thought produced by DeepSeek-R1, an open-source LRM released in January by the Chinese company DeepSeek. “If you actually look into these thoughts, they contain a consistent pattern,” said Reddy.
His team analyzed DeepSeek-R1’s “thoughts” and found four discernible phases. In the first phase, the LRM defined for itself the problem posed by a user. This could simply be a restating of the user’s input. Next, the LRM decomposed the problem and got to its first solution. “You can think of this as some initial chain of thought,” said Reddy. After it found a preliminary solution, the LRM engaged in “reconstruction,” a phase when it reconsidered its initial assumptions, and generated new text, starting with words such as “Wait,” “Alternatively,” “What if,” and so on. “This is where the model spends most of its time,” said Reddy. “It keeps cycling and cycling and cycling and gets to the answer finally.” The final phase produced the answer.
Reddy’s team even has a name for this field of study: “thoughtology” — the systematic study of reasoning chains or thoughts.