SAEs trained on the same data don’t learn the same features
In this post, we show that when two TopK SAEs are trained on the same data, with the same batch order but with different random initializations, there are many latents in the first SAE that don't have a close counterpart in the second, and vice versa. Indeed, when training only about 53% of the features are shared Furthermore, many of these unshared latents are interpretable. We find that narrower SAEs have a higher feature overlap across random seeds, and as the size of the SAE increases, the overlap decreases.
Why it matters
This finding has implications for the interpretability and robustness of deep learning models, as it suggests that models trained on the same data may learn different internal representations.
Key Points
- 1SAEs trained on the same data with different random initializations learn many distinct latent features
- 2Only about 53% of the learned features are shared between the two SAE models
- 3Narrower SAE architectures have higher feature overlap across random seeds
- 4Larger SAE models exhibit less shared learning of features
Details
The blog post explores the phenomenon that when two Stacked Autoencoder (SAE) models are trained on the same data, with the same batch order but different random initializations, they end up learning many latent features that do not have close counterparts in the other model. The authors find that only around 53% of the learned features are shared between the two SAE models. Furthermore, many of the unshared latents are interpretable, suggesting that the models are learning distinct representations of the input data. The authors also observe that narrower SAE architectures have a higher degree of feature overlap across random seeds, while larger SAE models exhibit less shared learning of features. This suggests that model size and architecture can impact the diversity of learned representations, even when trained on the same data.
No comments yet
Be the first to comment