OpenMOSS Interpretability Research
We aim to break down neural networks into basic features and understand what they represent, how they are connected and how these structures come to be formed. Ultimately, we pursue the beauty in their internal structures.
Bridging the Attention Gap: Complete Replacement Models for Complete Circuit Tracing
With attention and MLP replacement layers, we completely sparsified a language model to see its underlying local and global circuits.