Theory of Change
Goodfire's theory of change has three layers, evolved over its short existence:
Layer 1 -- Understanding (original, 2024): Build tools to "open the black box" of neural networks. Use mechanistic interpretability (SAEs, parameter decomposition) to extract interpretable features and circuits, then sell these tools to enterprises and researchers.
Layer 2 -- Intentional Design (pivoted to, Feb 2026): Use interpretability not just to inspect trained models but to shape training itself. Tom McGrath's manifesto: "The aim of intentional design is to use interpretability tools to shape training by sculpting learning from each individual datapoint." The key principle is "don't fight backprop" -- rather than pushing against gradient descent, reshape the loss landscape so the model naturally learns what you want. Their analogy: moving from selective breeding to genetic engineering.
Layer 3 -- Scientific Abundance: Extract knowledge from "narrowly superhuman" AI models to advance human science. Apply interpretability to biological foundation models, materials science models, etc. as a "microscope for understanding what the models have learned."
Eric Ho (CEO): "Interpretability, for us, is the toolset for a new domain of science: a way to form hypotheses, run experiments, and ultimately design intelligence rather than stumbling into it."
What They Do
Key technical result: RLFR (Reinforcement Learning from Feature Rewards) -- used lightweight probes on a frozen copy of a model to provide reward signals for RL training, reducing hallucinations in Gemma-3-12B-IT by 58% at ~90x lower cost than LLM-as-judge. Critically, the probe remained useful for monitoring after training, suggesting the process didn't undermine interpretability. This is their first proof-of-concept for intentional design.
Product: Ember platform -- the first hosted mechanistic interpretability API (launched Dec 2024). Supports Llama 3.3 70B and 3.1 8B. Capabilities include AutoSteer (automatic feature steering from natural language), feature search, contrastive search. Public API was deprecated in Feb 2026; now operates as a partner-deployed platform.
Customers: Rakuten (PII detection, 44M+ monthly queries), Microsoft, Mayo Clinic (genomic medicine), Arc Institute (Evo 2 genomics), Prima Mente (Alzheimer's biomarkers), Radical AI (materials science).
Scientific discovery: Applied interpretability to Prima Mente's epigenetic model and discovered cell-free DNA fragment length as a novel Alzheimer's biomarker -- previously unstudied in Alzheimer's literature. Dan Balsam: "one of the first examples of learning something new from a model by studying it."
Open-source contributions: Released SAEs for Llama 70B and 8B. Published SDK on GitHub. Ran hackathon with Apart Research (200+ researchers, 15 countries). Fellowship program for interp researchers.
Research pipeline: Stochastic Parameter Decomposition (SPD) led by Lee Sharkey as an alternative to activation-space methods. Circuit tracing replications. Infrastructure scaling to trillion-parameter models (Kimi K2).
Key People
Eric Ho (CEO, co-founder): Yale '16. Previously co-founded RippleMatch (AI recruiting, ~$80M raised, Forbes 30U30). Left 2023 citing AI risk concerns. No prior interpretability background. Startup operator who found a science-driven co-founder. Predicts neural nets "fully decoded by 2028." Self-described "AGI pilled."
Tom McGrath (Chief Scientist, co-founder): Founded DeepMind interpretability team. Senior Research Scientist at DeepMind 2019-2024. Author of the intentional design manifesto. More measured than Ho -- explicitly warns intentional design "probably shouldn't be used on frontier models today" and calls paranoia "a way of life."
Lee Sharkey (Principal Investigator): Co-founded Apollo Research (safety evals), previously at Conjecture. Pioneer of SAEs for language models. Now leading SPD/APD research. His move from a pure safety org (Apollo) to a commercial interpretability startup is a data point about where top researchers see leverage.
Team: ~51 employees as of Jan 2026. ML Engineer salary $200K-$400K. 5-day in-office (SF + NYC).
Money and Incentives
Funding: $207M total, entirely VC-funded.
- Seed: $7M (Aug 2024), Lightspeed lead
- Series A: $50M at $200M valuation (Apr 2025), Menlo lead; Anthropic $1M (first-ever startup investment)
- Series B: $150M at $1.25B valuation (Feb 2026), B Capital lead; DFJ Growth, Salesforce Ventures, Eric Schmidt
Revenue: Not disclosed. Usage-based API pricing ($0.35-$1.90/million tokens) now shifted to partner-deployed platform. "Tokens nearly tripling monthly" (Contrary, Aug 2025).
Estimated burn: With ~51 employees at $180K-$400K, annual burn is likely $20M-$40M+. Runway is comfortable at current burn but VC expects aggressive growth.
Business model: PBC (Delaware). Usage-based API shifted to selective enterprise partnerships. Compute-bound cost structure.
Incentive tensions:
- VC investors at $1.25B expect 10x+ returns. Scientific Abundance (pharma/biotech applications) has clearest revenue path. Pure safety research (Understanding) does not generate revenue directly.
- Anthropic's $1M investment creates a structural dynamic: Anthropic endorses Goodfire publicly, runs the world's largest in-house interpretability team, and benefits as investor regardless of outcome. If Anthropic's team supersedes Goodfire's tools, the company's safety rationale weakens.
- Investor base is commercial (B Capital, Salesforce Ventures, Eric Schmidt, Lightspeed), not impact-oriented. They will prioritize revenue growth.
- PBC structure is the minimum viable safety governance -- it allows prioritizing benefit but does not require it.
- No philanthropic funding means zero external accountability to safety community expectations.
What Others Say
"Most Forbidden Technique" criticism: Zvi Mowshowitz warned against training on interpretability because it trains models to evade interpretability monitors. Satya Benson applied this directly to Goodfire (LW, 32 karma): "This seems like an instance of 'The Most Forbidden Technique.'" The strongest commenter (JBlack): "the gradient-decomposition approach only works if interpretability is complete and accurate -- which it isn't." Goodfire's response: the frozen-copy probe approach structurally avoids the problem because the model can't backpropagate through the probe.
Independent steering evaluation: An external researcher benchmarked AutoSteer against prompting on Llama 8B/70B. Finding: "prompting remains the cheapest, most reliable control knob." Standalone steering drops coherence by ~0.6 points. (Author notes SDK has been updated since.)
Defense from researchers: An LW post defends the research direction: "It Is Reasonable To Research How To Use Model Internals In Training." Notes this is "a pretty normal area of interpretability research" with work from Anthropic Fellows, FAR, and others. Steven Byrnes provides a contingent defense based on brain-like reward systems where "the loop doesn't close."
Comprehensive interpretability critique: Charbel-Raphael's 12K-word takedown of nearly all interpretability theories of impact (2023). Argues interpretability is unlikely to help with deception, enumerative safety is doomed, and the field may be net harmful through dual use and false sense of control.
OpenAI's obfuscated reward hacking: Directly relevant precedent. Penalizing "bad thoughts" in chain-of-thought led to agents that still reward-hack but hide intent. McGrath explicitly addresses this, acknowledging the concern while arguing the frozen-copy approach is structurally different.
What's Absent
Board composition: Completely opaque. No public list of board members, investor board seats, or safety-focused governance. For a $1.25B PBC whose mission centers on safety, this is a striking omission.
Revenue and unit economics: No public revenue figures for a unicorn-valued company.
Safety policy for intentional design: Only a 362-word blog post on safety measures (feature moderation, I/O filtering). No policy on when intentional design techniques are safe for frontier models, who decides, or what safeguards exist.
Independent validation: The RLFR hallucination result (58% reduction) has no external replication. The only independent steering evaluation found prompting outperforms.
Comparison with Anthropic's interp team: No direct comparison exists between Goodfire's tools and Anthropic's in-house work (circuit tracing, attribution graphs, constitutional classifiers).
Publication policy: No stated policy for what research stays proprietary vs. gets published.
Recommended Reading
Dan Balsam & Tom McGrath on Cognitive Revolution (Feb 2026) -- Most candid source. Deep on intentional design, honest about limitations, responds to reward hacking concerns. "Paranoia is a way of life." https://www.cognitiverevolution.ai/don-t-fight-backprop-goodfire-s-vision-for-intentional-design-w-dan-balsam-tom-mcgrath/
"Against Almost Every Theory of Impact of Interpretability" (Charbel-Raphael, 2023) -- The steelman case against the entire field Goodfire operates in. https://www.lesswrong.com/posts/LNA8mubrByG7SFacm/against-almost-every-theory-of-impact-of-interpretability-1
"Intentionally Designing the Future of AI" (Tom McGrath, Feb 2026) -- Core manifesto. Best articulation of what Goodfire is actually trying to do and why. https://www.goodfire.ai/blog/intentional-design
"Goodfire and Training on Interpretability" (LW, Feb 2026) -- Direct community critique with the strongest specific objections. https://www.lesswrong.com/posts/B3DQvjCD6gp2JEKaY/goodfire-and-training-on-interpretability
Eric Ho on Sequoia's Training Data -- CEO's most polished interview: steam engine analogy, drug biochemistry analogy, and the boldest predictions. https://sequoiacap.com/podcast/training-data-eric-ho/