Researchers are uncovering how large language models work

LLMs are built using a technique called deep learning, in which a network of billions of neurons, simulated in software and modeled on the structure of the human brain, is exposed to trillions of examples of something to discover inherent patterns. Trained on strings of text, LLMs can hold conversations, generate text in a variety of styles, write software code, translate between languages, and much more.

The models are essentially grown, not designed, says Josh Batson, a researcher at Anthropic, an artificial intelligence startup. Because LLMs aren’t explicitly programmed, no one is entirely sure why they have such extraordinary capabilities. Nor do they know why they sometimes misbehave or give incorrect or made-up answers, known as “hallucinations.” LLMs are really black boxes. This is worrying, given that these and other deep learning systems are beginning to be used for all sorts of things, from offering customer support to preparing document summaries to writing software code.

It would be useful to be able to peer inside an LLM to see what’s going on, in the same way that it’s possible, with the right tools, to do so with a car engine or a microprocessor. Being able to understand the inner workings of a model in forensic bottom-up detail is called “mechanistic interpretability.” But it’s a daunting task for networks with billions of neurons inside. That hasn’t stopped people from trying, including Dr. Batson and his colleagues. In a paper published in May, they explained how they’ve gained new insights into the workings of one of Anthropic’s LLMs.

You might think that individual neurons within an LLM would correspond to specific words. Unfortunately, things are not that simple. Instead, individual words or concepts are associated with the activation of complex patterns of neurons, and individual neurons can be activated by many different words or concepts. This problem was pointed out in an earlier paper by Anthropic researchers, published in 2022. They proposed, and subsequently tested, several alternative solutions, achieving good results on very small language models in 2023 with a so-called “sparse autoencoder.” In their latest results, they have extended this approach to work with Claude 3 Sonnet, a full-size LLM.

A sparse autoencoder is essentially a second, smaller neural network that is trained on the activity of an LLM and looks for distinctive patterns in activity when “sparse” (i.e. very small) groups of its neurons fire together. Once many of these patterns, known as features, have been identified, researchers can determine which words trigger which features. The Anthropic team found individual features that corresponded to specific cities, people, animals, and chemical elements, as well as higher-level concepts like transportation infrastructure, famous tennis players, or the notion of secrecy. They ran this exercise three times, identifying features 1 m, 4 m, and, in the latest attempt, 34 m within Sonnet’s LLM.

The result is a kind of mental map of the LLM, showing a small fraction of the concepts it has learned from its training data. Places in the San Francisco Bay Area that are close geographically are also “close” to each other in conceptual space, as are related concepts such as diseases or emotions. “This is exciting because we have a partial, fuzzy conceptual map of what’s going on,” says Dr. Batson. “And that’s the starting point — we can enrich that map and expand from there.”

Concentrate the mind

In addition to seeing parts of the LLM light up, so to speak, in response to specific concepts, it’s also possible to change its behavior by manipulating individual features. Anthropic tested this idea by “activating” (i.e., turning on) a feature associated with the Golden Gate Bridge. The result was a version of Claude who was obsessed with the bridge and mentioned it at every opportunity. When asked how to spend $10, for example, he suggested paying the toll and driving across the bridge; when asked to write a love story, he made up one about a lovesick car that couldn’t wait to drive across.

It may sound silly, but the same principle could be used to discourage the model from talking about specific topics, such as the production of biological weapons. “AI safety is an important goal here,” says Dr. Batson. It can also be applied to behaviors. By tweaking specific features, models could become more or less sycophantic, empathetic or deceptive. Could a feature emerge that corresponds to the tendency to hallucinate? “We didn’t find a smoking gun,” says Dr. Batson. Whether hallucinations have an identifiable mechanism or signature is, he says, a “million-dollar question.” And it’s a question another group of researchers is tackling in a new paper published in Nature.

Sebastian Farquhar and his colleagues at the University of Oxford used a measure called “semantic entropy” to assess whether or not a statement by an LLM is likely to be a hallucination. Their technique is fairly straightforward: basically, the LLM is given the same prompt multiple times, and its responses are grouped by “semantic similarity” (i.e., by their meaning). The researchers’ intuition was that the “entropy” of these responses (in other words, the degree of inconsistency) corresponds to the LLM’s uncertainty, and therefore the likelihood of hallucination. If all of its responses are essentially variations on a theme, they probably aren’t hallucinations (although they may still be incorrect).

In one example, the Oxford group asked a graduate student which country is associated with fado music, and the student consistently responded that fado is the national music of Portugal, which is correct and not a hallucination. But when asked about the function of a protein called StarD10, the model gave several very different answers, suggesting a hallucination. (The researchers prefer the term “confabulation,” a subset of hallucinations they define as “arbitrary and incorrect generations.”) Overall, this approach was able to distinguish between accurate statements and hallucinations 79 percent of the time, ten percentage points better than previous methods. This work is complementary, in many ways, to Anthropic’s.

Others have been uncovering LLMs, too: OpenAI’s “superalignment” team, creator of GPT-4 and ChatGPT, published its own paper on sparse autoencoders in June, though the team has now disbanded after several researchers left the company. But the OpenAI paper contained some innovative ideas, says Dr. Batson. “We’re really glad to see groups everywhere working to better understand models,” he says. “We want everyone to do it.”

© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under license. The original content can be found at www.economist.com

Source link

Disclaimer:
The information contained in this post is for general information purposes only. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services, or related graphics contained on the post for any purpose.
We respect the intellectual property rights of content creators. If you are the owner of any material featured on our website and have concerns about its use, please contact us. We are committed to addressing any copyright issues promptly and will remove any material within 2 days of receiving a request from the rightful owner.

Leave a Comment