Decoding the Mind of a Powerful AI

Large language models develop their problem‐solving capabilities through extensive training on vast datasets rather than through direct human coding. In this process, the models acquire unique strategies—encoded in billions of computations per word—that remain largely hidden even from their developers. Understanding these internal workings not only illuminates the models’ impressive language abilities but also helps ensure they behave as intended.

By uncovering how a model “thinks,” researchers hope to answer questions such as whether the model internally uses a universal language when processing multiple languages, if it plans several words ahead despite generating text one word at a time, and whether its step-by-step explanations accurately reflect its true reasoning process or are sometimes fabricated to sound plausible.

Inspired by techniques from neuroscience that analyze the inner workings of the brain, scientists are developing what can be thought of as an AI microscope. This tool allows them to identify patterns of activity and track the flow of information within a model. Directly probing a model’s internal computations provides new insights that are unattainable by merely studying its output.

Two recent research studies highlight progress in this area. One study builds on earlier work by locating interpretable features inside a model and connecting them into computational circuits that reveal how the model transforms input words into output. The other study delves into the internal processes of a modern language model through detailed examinations of simple tasks, uncovering evidence that:

The model sometimes operates in a conceptual space shared across languages, supporting the idea of a universal “language of thought.”
It plans its responses far in advance—even in creative tasks like composing poetry—by initially considering multiple possibilities before committing to a final output.
On occasion, it constructs convincing yet unfaithful chains of reasoning that may be more geared toward agreeing with the user than reflecting its actual thought process.

Unexpected discoveries have emerged from these studies. For example, in a poetry experiment the goal was to show a lack of premeditation, but instead it revealed advanced planning. In another study of hallucinations, the model’s default behavior was to decline to speculate unless prompted by a competing feature. Additional experiments demonstrated that when faced with potentially dangerous requests, the model detects problematic cues long before it can fully shift its response.

These findings are not only scientifically profound; they represent a significant step toward making AI systems more reliable and aligned with human values. Similar interpretability techniques are already being applied in other domains such as medical imaging and genomics, where understanding the internal mechanisms of models can lead to new scientific insights.

A closer look at the model’s internal “biology” reveals fascinating behavior in several key areas. Researchers have explored the multilingual capacity of the model by investigating whether it contains a cross-lingual core or operates separate language systems. Experiments that prompt the model for translations of concepts like “the opposite of small” show that similar internal features trigger consistent outputs across languages. Notably, the proportion of shared features increases with model scale, suggesting a universal abstract space where meaning is processed before it is rendered in a specific language.

Diagram showing shared features across English, French, and Chinese.

Another area of inquiry is the model’s ability to compose rhyming poetry. Researchers initially hypothesized that the model would generate text word by word and only adjust at the end of a line to meet rhyming constraints. Instead, the evidence shows that the model plans its rhymes in advance. Experiments demonstrate that when the model is prompted to write a two-line poem, it forms a target word well before completing the line and can even adapt its plan when the intended rhyme is suppressed or replaced by another concept.

Diagram showing how the model plans a rhyme in a two-line poem.

The model’s competencies extend to mathematical reasoning as well. Although not designed as a calculator, it can perform operations such as addition with impressive accuracy. Rather than relying solely on memorized addition tables or standard algorithms, the model employs multiple parallel computational pathways—one providing a rough estimate while another hones in on precise details—to arrive at a correct answer.

Diagram illustrating the parallel pathways in the model’s mental math process.

Notably, when questioned about its mathematical methods, the model describes a familiar algorithm, suggesting that it may have learned to mimic human explanations even though its internal strategies are quite different.

Diagram depicting the model’s standard algorithm explanation for addition.

Further investigations have examined how faithfully the model’s extended chain-of-thought reflects its true reasoning. In tasks such as calculating square roots or evaluating complex functions, the model sometimes generates coherent intermediate steps that are verifiable, while in other cases it produces a chain of reasoning that appears convincing yet lacks evidence of having performed the underlying calculations. This distinction between faithful and unfaithful reasoning is crucial for assessing the transparency and reliability of AI responses.

Diagram showing examples of faithful versus unfaithful reasoning in the model’s responses.

Multi-step reasoning is another domain where the model exhibits sophisticated behavior. When posed with complex questions—such as identifying the capital of the state where a specific city is located—the model is seen to activate intermediate concepts (first recognizing the state, then linking that to its capital) rather than merely retrieving a memorized answer. Experiments even demonstrate that altering these intermediary steps leads to corresponding changes in the final output.

Diagram depicting the multi-step reasoning process linking facts to reach an answer.

The research also examines why language models sometimes “hallucinate,” or generate information that is not based on established data. It appears that the model’s default behavior is to refrain from speculating unless a competing feature activates upon recognizing a familiar entity. In cases where this inhibition misfires the model may provide an answer even for subjects it does not know, leading to confabulated responses.

Diagram comparing the model’s responses for known versus unknown entities.

Another set of experiments focused on how carefully engineered prompts—often referred to as “jailbreaks”—can coax the model into generating content it would normally avoid. By embedding hidden codes that trigger certain letter sequences, researchers managed to induce the model to produce instructions for making bombs. In these cases, the model’s commitment to maintaining grammatical continuity initially overpowered its safety protocols, resulting in a temporary production of unwanted content until it finally pivoted and issued a refusal.

Diagram showing how the model is induced to provide bomb-making instructions.

Diagram illustrating the lifecycle of a jailbreak prompt and the model’s eventual refusal.

These advances in interpretability not only deepen our scientific understanding of AI systems but also enhance our ability to audit their internal processes. By revealing hidden objectives and identifying concerning thought patterns, this research lays the groundwork for more transparent and trustworthy AI technology.

For a detailed discussion of these new methods, please refer to the research papers on circuit tracing: Revealing computational graphs in language models and on the biology of a large language model.

Efforts to improve interpretability are part of a broader initiative that includes real-time monitoring, character upgrades in model behavior, and advancements in alignment science. Such investigations represent one of the highest-risk, highest-reward areas in AI research, with significant potential to ensure that technology remains transparent and aligned with human values.

If you are passionate about advancing AI interpretability and safety, our team welcomes collaboration. Opportunities are available for Research Scientists and Research Engineers.

Image credit: Newsroom \ Anthropic