ArXiv
Preprint
Source
Code
Github
Many approaches for identifying causal model components rely on brute-force activation patching, a method that is both slow and inefficient. Moreover, these methods often fall short in identifying components that collaborate to generate a desired output.
In this paper, we introduce a technique for automatically identifying causal model components through optimization over an intervention. We establish a set of desiderata, representing the causal attributes of the model components involved in the specific task. Utilizing a synthetically generated dataset aligned with these desiderata, we optimize a binary mask over the model components to pinpoint the causal elements. The resulting optimized mask identifies the model components that encode the desired information, as semantically probed by the desiderata.
A significant limitation of current methods for identifying the circuit responsible for a specific task is that, while they aid in localizing components within the neural network, they fail to offer insights into the semantics of these components. In other words, they do not tell you what the components are doing. This is where our method comes in. We use desiderata to guide the optimization of the mask. This allows us to not only locate the components, but also understand what they are doing. This is a major advantage of our method over existing methods.
Follow these steps to locate model components using the desiderata-based masking technique:
As a proof of concept, we apply our method to automatically discover shared variable binding circuitry in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.
Our work builds upon the existing research that aims to localize model components responsible for performing specific tasks:
Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov. Locating and Editing Factual Associations in GPT.
Advances in Neural Information Processing Systems (2023).
Notes: Proposed Causal Tracing to locate factual associations. This uncovers a specific sequence
of steps within middle-layer feed-forward modules that play a role in mediating factual predictions during the
processing of subject tokens.
Zhengxuan Wu, Atticus Geiger, Christopher Potts, Noah D. Goodman. Interpretability at
Scale: Identifying Causal Mechanisms in Alpaca. arXiv preprint (2024).
Notes: Proposed Boundless Distributed Alignment Search (Boundless DAS) to uncover the alignments
between interpretable
symbolic algorithms and the neural network's internal representations.
Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso. Towards
Automated Circuit Discovery for Mechanistic Interpretability. arXiv preprint (2023).
Notes: Proposed Automatic Circuit DisCovery (ACDC) to automatically discover the circuits based
on the choosen metric and dataset.
The preprint can be cited as follows.
Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau.
Discovering Variable Binding Circuitry with Desiderata." arXiv preprint
@article{davies2023discovering, title={Discovering Variable Binding Circuitry with Desiderata}, author={Davies, Xander and Max Nadeau and Nikhil Prakash and Tamar Rott Shaham and David Bau}, journal={arXiv preprint arXiv:2307.03637}, year={2023} }