Mech Interp: Paradigms

Jul 10, 2025

With so much to look at, it is difficult to decide where to start, but it is probably best to just start somewhere. To start off, I think I will read Toy Models of Superposition (Elhage et al 2022). I remember really liking the findings and approach of the paper, since trying to understand simple toy models seems like a very intuitive approach to me. I have found this very useful even in my own research, where instead of directly trying to understand the results of large simulations with lots of moving pieces, we try to understand much simple toy model simulations to build a scaffold for understanding the complex behavior we observe. Since the work deals with toy models, I should also be able to play around and run some of these experiments fairly easily. This should also be a natural start to understanding sparse autoencoders (SAEs) since they seem to be the current main approach towards untangling this problem of superposition.

Before that I would like to read some resource that gives a higher level view of the history of the field and where it is going, to get a better sense of the bird’s eye view landscape. At the same time I am keeping a list of concepts I want to learn more about as I come across them, and hopefully this list does not get too long.

I decided to begin by reading this blog post on the alignment forum by Lee Sharkey titled “Mech interp is not pre-paradigmatic”. In particular, since I am looking for some initial paradigm in which to frame mech interp research, the title of this post indicated it would present some form of such a framework. So as the title suggests, this post basically begins by arguing that the field of mech interp is not in what might be called a pre-paradigmatic phase, and instead is actually at the mature stage of a second mini-paradigmic wave. Pre-paradigmatic here refers to a stage in the development of a scientific field before it has established a dominant theoretical framework or "paradigm." This concept comes from philosopher Thomas Kuhn's influential work "The Structure of Scientific Revolutions”, where he outlines how a field progresses through these stages. While some argue mech interp, being a relatively new field, is in this pre-paradigmatic phase, this post pushes the idea that mech interp actually begun as an offshoot of computational neuroscience and hence inherits many of the ideas and concepts established there. In fact, the argument is made that it has been rediscovering many of the same ideas explored in neuroscience! I certainly lack sufficient knowledge in either area to judge this statement, but I don’t find this claim surprising given the parallels. Further, given the greater tractability of working with neural networks, perhaps progress might actually eventually flow in the other direction! I think the worry lies in the fact that neuroscience seems to be struggling to make recent progress and that mech interp might reach a similar wall. The article suggests the following Three Waves of Mech Interp:

First Wave (2010s): Focused on demonstrating that interpretable structure exists in deep neural networks. 'Features are the fundamental unit of neural networks' and 'features are connected by weights, forming circuits' . This wave ended when researchers discovered polysemantic neurons (neurons that respond to multiple unrelated concepts).
Second Wave (2022-present): Emerged after the "Toy Models of Superposition" paper, introducing sparse dictionary learning (SDLs) to address polysemanticity. However, this wave now faces its own anomalies.
Potential Third Wave? The post suggests "Parameter Decomposition" as a promising approach that could resolve Second-Wave anomalies by decomposing neural network parameters into interpretable components representing computational mechanisms. Worth nothing that this is what the author is working on at present.

At the same time, I think the transition from first to second wave lines up with the rise of LLMs, since much early work was done with CNNs. Much larger LLMs clearly showed much richer representation and required understanding polysemanticity and superposition. As for parameter decomposition, I will have to come back to this again since I don’t understand enough to appreciate its arguments. This outline also makes me feel like the Toy Models paper is indeed the best place to start if it marks this big transition in the thinking in the field. I think it would be good for me to try and define mech interp, superposition and polysemancity, perhaps after going through the paper, and for example think about this particular quote from the post:

The idea that 'networks represent more features than they have neurons'. It is a natural corollary of the superposition hypothesis that neurons would exhibit polysemanticity, since there cannot be a one-to-one relationship between neurons and 'features'.

The article also presents a list of anomalies in this Second Wave that I would like to return to once I have built more understanding, since these should be the open problems the field is looking at now. While this post is just one perspective, I think it is a sufficiently good mental framework to start with, with lots of references and ideas to think about. Alright, onwards to Toy Models!

Build based on Takuya Matsuyama.