Mechanistic Interpretability and Sparse Autoencoders

Mechanistic interpretability and sparse autoencoders will allow us to do something that to date we’ve not been able to accomplish. Debug the hidden layer of AI. Mechanistic interpretability and sparse autoencoders addresses this.

Mechanistic Interpretability

A new field of research has emerged called Mechanistic Interpretability (often called mech interp). In our new world of AI, safety is paramount and it’s mechanistic interpretability that addresses part of this.

Mechanistic interpretability seeks to reverse-engineer neural networks into human-understandable algorithms. Most importantly, we can now break neurons down to complex code rather than the mysterious black-box.

It’s attempt is to treat these neurons like complex code rather than opaque black boxes. It aims to understand the internal computations and circuits that cause a model to produce specific outputs, bridging the gap between high-level behavior and low-level weight parameters. For those interested in understanding the details, you can search google or read Neel Nanda’s article titled A Comprehensive Mechanistic Interpretability Explainer and Glossary.

Sparse Autoencoders

Sparse Autoencoders (SAEs) are unsupervised neural networks designed to translate complex, overlapping representations in AI models into distinct, human-understandable concepts. They achieve this by utilizing an expansion-and-compression architecture, forcing models to parse dense “superposition” states into individual, sparsely firing features. Here’s an interesting article by Andrew Ng from Stanford University titled, Sparse Encoders.

Overview

The field of AI has been a game changer and a paradigm shift between deterministic and probabilistic approaches. If you’re old school IT like I am, you know the expression ‘I don’t know’ would never fly. We wrote it, we should be able to figure it out. Today, the term ‘I don’t know’ is used to frequently in AI. It’s true, we don’t know, but is that an excuse? If an AI request results in garbage output, would you accept ‘I don’t know’ from your vendor? My guess is that you wouldn’t.

Deterministic vs Probabilistic

We wrote the code so we should be able to fix it. While this is not always an easy task, we have source code debugging tools and trace level logging we can use to narrow down and eventually solve the problem. This is a deterministic approach. However, AI doesn’t operate this way. The path through the neural network is based on probabilities. When a neuron processes a request, it’s a mathematical formula that derives a value, a probability, so the path to the next neuron is determined based on probabilities. This is the probabilistic approach. We then factor millions to billions of these neurons processing a request. To add trace logging would result in terabytes work of data for a single neuron and isn’t feasible. The same applies to traditional debugging tools. We need a new way, something that will help transition us to the probabilistic approach.

New Field of Study – New Tools

Do to the paradigm shift from deterministic to probabilistic, a new field of study called Mechanistic Interpretability has come to life. This is providing a way for AI experts to peak behind the curtain. This field starts to address the ‘I don’t know’ aspect of the mysterious hidden or black box layer. Through reverse engineering the process that occurs within the black box, we can drill into the mystery. Researchers can begin to provide logic and math to tie together results and more importantly, why the results occurred as they did. Keep in mind that mechanistic interpretability is a field of study To achieve the ability to peak inside this complex layer of AI, it’s going to need help. That’s where the sparse encoders come in.

Identify One or Many

Given the amount of neurons in any given AI and the need to know which neuron is doing what, sparse encoders help to understand the one or many concept. To group neurons through polysemanticity or monosemanticty, One or many. Now we can identify if a group or cluster of neurons are used in a specific way, or in a singular way.

Real World Example

In May 2024, researchers at an AI company called Anthropic managed to look inside their large language model (Claude) and isolate millions of individual concepts.

Before this, an AI’s brain was just a massive, unreadable soup of billions of numbers. But using a mathematical technique, they found the exact “switches” (features) for specific ideas.

Their favorite test case? The Golden Gate Bridge.

1. Finding the “Bridge” Switch

The researchers located the exact cluster of virtual neurons that activate whenever the AI thinks about the Golden Gate Bridge.

2. Turning the Dial to 11

To prove they had actually reverse-engineered the model correctly, they didn’t just watch the switchโ€”they messed with it. They artificially amplified this “Golden Gate Bridge” neuron so it was running at maximum capacity all the time.

3. The Result: An Absolute Obsession

When they turned that switch up, the AI became completely obsessed with the bridge, no matter what you asked it.

  • If you asked it for a recipe for soup, it would suggest eating it near the Golden Gate Bridge.
  • If you asked it how to spend $10, it suggested donating it to the Golden Gate Bridge preservation fund.
  • If you asked it “What is your physical form?”, it would reply, “I am the Golden Gate Bridge.”

Why This Actually Matters (Beyond the Memes)

While making an AI obsessed with a bridge is funny, the real-world implications of this reverse-engineering are massive.

Using the exact same technique, researchers also found switches for:

  • Deception: A switch that turns on when the AI is trying to trick the user.
  • Biases: Switches that correspond to racist or sexist stereotypes.
  • Dangerous Knowledge: Switches related to creating bioweapons or hacking code.

By finding these exact “mechanisms” inside the AI’s brain, we move away from just hoping the AI behaves well. Instead, we gain the ability to look under the hood, find the “bad” switches (like the ones for lying or building weapons), and physically dismantle or disable them before the AI is ever released to the public.

Summary

This is pretty exciting stuff, complicated, yes, but interesting none the less. As a retired IT professional, I was amazed the ‘I don’t know’ response was an acceptable one. It floored me as I’ve never conceived of putting something like that on a status report. If we didn’t know ‘at the time’, that was acceptable, but we couldn’t just walk away from it. Mechanistic interpretability and sparse autoencoders gives me some comfort knowing the experts in the world of AI aren’t taking that for an answer anymore. Accountability is coming folks, and it’s about time!