Large Language Models Take Center Stage in Simons Institute Workshop

Large-language-models-transformers for web page

In her play Fires in the Mirror, playwright Anna Deavere Smith performed a series of monologues, each embodying someone she interviewed in real life. In one monologue, she channeled MIT physicist Aron Bernstein talking about building mirrors for telescopes. These instruments have flaws, and to overcome them — “if you wanna look up in the heavens and see the stars as well as you can without distortion” — we need mirrors as big as possible.

This idea of building the biggest mirror possible was the metaphor for the recently concluded Simons Institute workshop on Large Language Models and Transformers. Umesh Vazirani, the Institute’s research director for quantum computing, chaired and organized the workshop, and he alluded to Smith and her play when opening the proceedings: “When you are trying to understand something difficult, it’s really helpful to have a very large number of points of view juxtaposed, in a way where there is a lot of discussion,” he said.

The difficult thing everyone is trying to understand is, of course, the technology du jour: large language models (LLMs), such as ChatGPT, PaLM, GPT-4, and the like. Today’s LLMs are massive artificial neural networks, with about half-a-trillion parameters, where the term “parameters” refers to the number of connections between individual computing units, or artificial neurons. These networks are trained on equally massive amounts of text taken off the internet and elsewhere; the training involves taking a sentence, for example, masking the last word, and getting the network to predict the masked word (technically, LLMs deal in “tokens,” which are not necessarily entire words, but words are a good proxy). When it makes a mistake, the network’s parameters are tweaked ever so slightly, so that when given the same sentence again, the network does a little better at predicting the masked word. This is done for every word in every sentence in the training data, until the network’s error rate becomes acceptably low.

Such a trained network can then be given prompts — in the form of some text — and it predicts the next word, appends it to the prompt, predicts the next word, appends it to what came before, and so on. Eventually, the network spews out a large amount of text, triggered entirely by the initial prompt.

That this approach would yield anything useful was far from obvious. In 2019, when OpenAI released GPT-2 (the acronym stands for “generative pretrained transformer”), it wasn’t particularly impressive. “GPT-2, off the shelf, is completely hopeless,” said Yejin Choi of the University of Washington, during a panel discussion on the workshop’s first day. GPT-2 had 1.5 billion parameters. But its descendant, GPT-3, with 175 billion parameters, changed the landscape.

Suddenly, LLMs are all the rage. ChatGPT, a chatbot built using GPT-3.5 and augmented using reinforcement learning from human feedback (RLHF, a method to align the LLM’s outputs to human values and thus prevent it, for example, from producing racist, sexist, or malicious text), was the first to capture the imaginations of both academics and the public. OpenAI has since released GPT-4. These LLMs can now do surprisingly complex tasks: produce outputs in response to math questions that are suggestive of an ability to reason; write a Star Wars episode in the style of Douglas Adams (as UC Berkeley’s Alexei Efros said during the workshop, GPT can do so and it’s “brilliant, it’s funny, it’s cool, it’s great”); produce code to help programmers become more productive; answer theory-of-mind questions, which ostensibly involves “modeling” the minds of others; and so on. Of course, LLMs also produce egregiously erroneous outputs, in ways that are indicative of their inability to do any real reasoning.

Such errors notwithstanding, LLMs are showing intriguing prowess. “The recent developments around LLMs are a pivotal moment!” said Yasaman Bahri of Google DeepMind, a speaker at the workshop. “Beyond it being an engineering feat, I view it as an empirical discovery in a way, since it was not clear that such a procedure should have succeeded — namely, that a relatively simple set of ingredients, such as next-token prediction and massive amounts of data and computational power, can be used to construct language models that work as well as they do.”

It’s this pivotal moment that the workshop sought to illuminate. “The selection of talks … portrayed well the diversity of expertise that is relevant for research around LLMs,” said Bahri.

There were talks that highlighted empirical observations, including the opening talk by Yin Tat Lee of Microsoft Research on “Sparks of Artificial General Intelligence,” a reference to his team’s exploration of the capabilities of GPT-4, and Choi’s talk on what might be possible or impossible, given our current understanding of LLMs.

Others addressed theoretical concerns. For example, Ilya Sutskever, cofounder and chief scientist of OpenAI, spoke about a theory of unsupervised learning (something that could be applied to LLMs, which are unsupervised or, more precisely, self-supervised learners). Sanjeev Arora, professor of computer science at Princeton University, proposed a theoretical framework to explain so-called emergent phenomena in LLMs (abilities that appear as models are scaled up, either by making them bigger or by using more training data).

Questions about data kept cropping up in talks and discussions. As LLMs get bigger, will they run out of training data? Is the quality of the training data important? Will synthetic data, generated by older LLMs, overcome the possible impending shortage of training data?

Perhaps one of the most intriguing and unexpected consequences of large language models is their impact on our understanding of aspects of human cognition, in particular human language ability. Some talks took this on. Steven Piantadosi of UC Berkeley spoke about how LLMs, which have shown that it’s possible to learn syntax and even some semblance of semantics purely from the statistical properties of text, are causing cognitive scientists to do a double take on their theories of how human grammar and language develop. Lio Wong and Alex Lew, of Josh Tenenbaum’s group at MIT, took on an even more provocative question: What role does language play in the development of intelligence, including human intelligence?

Concerns over misuses of LLMs loomed large. Nicholas Carlini of Google DeepMind presented a principled framework for demonstrating that even models that have been aligned using RLHF can be fooled into generating harmful content. Previously, researchers have shown how to generate so-called adversarial examples for images, such that an image-recognition AI can be fooled into misclassifying an image of, say, a tabby cat as that of guacamole, simply by introducing imperceptible noise into the image. Carlini showed how designing prompts that have the equivalent of such noise can nudge an LLM into generating, say, hate-filled speech and foul language, which it otherwise would refuse to do.

Misuse can be more innocuous, but it’s misuse nonetheless: say, students using LLMs to generate essays they have been assigned to write. To prevent such misuse, Scott Aaronson of the University of Texas at Austin, who is on leave to work at OpenAI, spoke about watermarking — methods of inserting statistical signatures into text generated by LLMs that can be detected later. Of course, these methods might be subverted by adversarial AIs capable of spotting such watermarks. “In the limit, where you have AI on both sides of the problem, I think it’s far from obvious who wins this race,” said Aaronson.

One issue that has been center stage during the evolution of LLMs over the past few years concerns their ability to generate truly novel data. Some researchers in the recent past have argued that they cannot. For example, in 2021, Emily Bender of the University of Washington and colleagues used the evocative phrase “stochastic parrots” to describe LLMs, a phrase that implies that LLMs are merely regurgitating — albeit in complex, randomized ways — existing data. In his talk at the workshop, Yin Tat Lee asked: Are LLMs simply copy-and-paste machines on steroids? His talk implied that the answer is a tentative no. Spirited debate about whether LLMs can truly generalize — meaning, extrapolate beyond the bounds of training data — did take place during the workshop, but conclusive answers elude researchers. The workshop participants, however, skewed toward the notion that modern LLMs seem to be doing more than just sophisticated statistical correlations.

The use of metaphors to describe technology — say, referring to LLMs as stochastic parrots or otherwise — might prove central to how judges rule on questions about whether LLMs are infringing on someone’s copyrighted data, said Pamela Samuelson, professor of law and information at UC Berkeley, in her talk, “Large Language Models Meet Copyright Law.”

Overall, the workshop seemed to capture the prevalent mood among researchers and industry. “The workshop atmosphere was thick with expectation and excitement,” said Efros, comparing it to what might have been the mood at another epochal moment in scientific history — the development of quantum physics in the early 1900s. “I imagine that a gathering of physicists at the dawn of the 20th century might have felt similar — everyone sensed that something big was coming, but it wasn’t quite clear what.”