Article · book: the scaling curve: dario amodei, anthropic, and the race to build and survive superintelligence · technology

The Scaling Curve: Dario Amodei, Anthropic, and the Race to Build and Survive Superintelligence — Chapter Eight

1. Mechanistic interpretability aims to understand the *how* and *why* of AI model behavior, addressing the "black box" problem inherent in systems that are "grown" rather than explicitly programmed.
2. Chris Olah pioneered the idea of neural networks as "biological artifacts," a vision initially marginal but championed by Dario Amodei as crucial for AI safety and deep scientific insight, potentially impacting neuroscience.
3. Mechanistic interpretability deciphers internal AI algorithms, overcoming "polysemanticity" with the concept of "superposition" to identify and map millions of human-understandable features and circuits within models.
4. A mid-2024 demo showcased the ability to manipulate internal AI concepts by amplifying a "Golden Gate Bridge" feature in Claude, demonstrating the potential to control features related to deception or power-seeking.
5. Interpretability explains how AI's sudden acquisition of specific capabilities, often unpredictable by scaling laws, results from internal circuits gradually strengthening and consolidating until they "snap into place."
6. Anthropic maintained its interpretability team for years without commercial return, openly publishing research to advance AI safety across the industry, and by 2025, began realizing practical commercial applications.