Goodfire

Research

Commercial interpretability. Sparse autoencoders.

Founded: 2024
HQ: San Francisco, CA
Team: 51
Structure: PBC
Model: Vc Investment

Theory of Change

Goodfire's theory of change has three layers, evolved over its short existence:

Layer 1 -- Understanding (original, 2024): Build tools to "open the black box" of neural networks. Use mechanistic interpretability (SAEs, parameter decomposition) to extract interpretable features and circuits, then sell these tools to enterprises and researchers.

Layer 2 -- Intentional Design (pivoted to, Feb 2026): Use interpretability not just to inspect trained models but to shape training itself. Tom McGrath's manifesto: "The aim of intentional design is to use interpretability tools to shape training by sculpting learning from each individual datapoint." The key principle is "don't fight backprop" -- rather than pushing against gradient descent, reshape the loss landscape so the model naturally learns what you want. Their analogy: moving from selective breeding to genetic engineering.

Layer 3 -- Scientific Abundance: Extract knowledge from "narrowly superhuman" AI models to advance human science. Apply interpretability to biological foundation models, materials science models, etc. as a "microscope for understanding what the models have learned."

Eric Ho (CEO): "Interpretability, for us, is the toolset for a new domain of science: a way to form hypotheses, run experiments, and ultimately design intelligence rather than stumbling into it."

What They Do

Key technical result: RLFR (Reinforcement Learning from Feature Rewards) -- used lightweight probes on a frozen copy of a model to provide reward signals for RL training, reducing hallucinations in Gemma-3-12B-IT by 58% at ~90x lower cost than LLM-as-judge. Critically, the probe remained useful for monitoring after training, suggesting the process didn't undermine interpretability. This is their first proof-of-concept for intentional design.

Product: Ember platform -- the first hosted mechanistic interpretability API (launched Dec 2024). Supports Llama 3.3 70B and 3.1 8B. Capabilities include AutoSteer (automatic feature steering from natural language), feature search, contrastive search. Public API was deprecated in Feb 2026; now operates as a partner-deployed platform.

Customers: Rakuten (PII detection, 44M+ monthly queries), Microsoft, Mayo Clinic (genomic medicine), Arc Institute (Evo 2 genomics), Prima Mente (Alzheimer's biomarkers), Radical AI (materials science).

Scientific discovery: Applied interpretability to Prima Mente's epigenetic model and discovered cell-free DNA fragment length as a novel Alzheimer's biomarker -- previously unstudied in Alzheimer's literature. Dan Balsam: "one of the first examples of learning something new from a model by studying it."

Open-source contributions: Released SAEs for Llama 70B and 8B. Published SDK on GitHub. Ran hackathon with Apart Research (200+ researchers, 15 countries). Fellowship program for interp researchers.

Research pipeline: Stochastic Parameter Decomposition (SPD) led by Lee Sharkey as an alternative to activation-space methods. Circuit tracing replications. Infrastructure scaling to trillion-parameter models (Kimi K2).

Key People

Eric Ho (CEO, co-founder): Yale '16. Previously co-founded RippleMatch (AI recruiting, ~$80M raised, Forbes 30U30). Left 2023 citing AI risk concerns. No prior interpretability background. Startup operator who found a science-driven co-founder. Predicts neural nets "fully decoded by 2028." Self-described "AGI pilled."

Tom McGrath (Chief Scientist, co-founder): Founded DeepMind interpretability team. Senior Research Scientist at DeepMind 2019-2024. Author of the intentional design manifesto. More measured than Ho -- explicitly warns intentional design "probably shouldn't be used on frontier models today" and calls paranoia "a way of life."

Lee Sharkey (Principal Investigator): Co-founded Apollo Research (safety evals), previously at Conjecture. Pioneer of SAEs for language models. Now leading SPD/APD research. His move from a pure safety org (Apollo) to a commercial interpretability startup is a data point about where top researchers see leverage.

Team: ~51 employees as of Jan 2026. ML Engineer salary $200K-$400K. 5-day in-office (SF + NYC).

Money and Incentives

Funding: $207M total, entirely VC-funded.

Seed: $7M (Aug 2024), Lightspeed lead
Series A: $50M at $200M valuation (Apr 2025), Menlo lead; Anthropic $1M (first-ever startup investment)
Series B: $150M at $1.25B valuation (Feb 2026), B Capital lead; DFJ Growth, Salesforce Ventures, Eric Schmidt

Revenue: Not disclosed. Usage-based API pricing ($0.35-$1.90/million tokens) now shifted to partner-deployed platform. "Tokens nearly tripling monthly" (Contrary, Aug 2025).

Estimated burn: With ~51 employees at $180K-$400K, annual burn is likely $20M-$40M+. Runway is comfortable at current burn but VC expects aggressive growth.

Business model: PBC (Delaware). Usage-based API shifted to selective enterprise partnerships. Compute-bound cost structure.

Incentive tensions:

VC investors at $1.25B expect 10x+ returns. Scientific Abundance (pharma/biotech applications) has clearest revenue path. Pure safety research (Understanding) does not generate revenue directly.
Anthropic's $1M investment creates a structural dynamic: Anthropic endorses Goodfire publicly, runs the world's largest in-house interpretability team, and benefits as investor regardless of outcome. If Anthropic's team supersedes Goodfire's tools, the company's safety rationale weakens.
Investor base is commercial (B Capital, Salesforce Ventures, Eric Schmidt, Lightspeed), not impact-oriented. They will prioritize revenue growth.
PBC structure is the minimum viable safety governance -- it allows prioritizing benefit but does not require it.
No philanthropic funding means zero external accountability to safety community expectations.

What Others Say

"Most Forbidden Technique" criticism: Zvi Mowshowitz warned against training on interpretability because it trains models to evade interpretability monitors. Satya Benson applied this directly to Goodfire (LW, 32 karma): "This seems like an instance of 'The Most Forbidden Technique.'" The strongest commenter (JBlack): "the gradient-decomposition approach only works if interpretability is complete and accurate -- which it isn't." Goodfire's response: the frozen-copy probe approach structurally avoids the problem because the model can't backpropagate through the probe.

Independent steering evaluation: An external researcher benchmarked AutoSteer against prompting on Llama 8B/70B. Finding: "prompting remains the cheapest, most reliable control knob." Standalone steering drops coherence by ~0.6 points. (Author notes SDK has been updated since.)

Defense from researchers: An LW post defends the research direction: "It Is Reasonable To Research How To Use Model Internals In Training." Notes this is "a pretty normal area of interpretability research" with work from Anthropic Fellows, FAR, and others. Steven Byrnes provides a contingent defense based on brain-like reward systems where "the loop doesn't close."

Comprehensive interpretability critique: Charbel-Raphael's 12K-word takedown of nearly all interpretability theories of impact (2023). Argues interpretability is unlikely to help with deception, enumerative safety is doomed, and the field may be net harmful through dual use and false sense of control.

OpenAI's obfuscated reward hacking: Directly relevant precedent. Penalizing "bad thoughts" in chain-of-thought led to agents that still reward-hack but hide intent. McGrath explicitly addresses this, acknowledging the concern while arguing the frozen-copy approach is structurally different.

What's Absent

Board composition: Completely opaque. No public list of board members, investor board seats, or safety-focused governance. For a $1.25B PBC whose mission centers on safety, this is a striking omission.

Revenue and unit economics: No public revenue figures for a unicorn-valued company.

Safety policy for intentional design: Only a 362-word blog post on safety measures (feature moderation, I/O filtering). No policy on when intentional design techniques are safe for frontier models, who decides, or what safeguards exist.

Independent validation: The RLFR hallucination result (58% reduction) has no external replication. The only independent steering evaluation found prompting outperforms.

Comparison with Anthropic's interp team: No direct comparison exists between Goodfire's tools and Anthropic's in-house work (circuit tracing, attribution graphs, constitutional classifiers).

Publication policy: No stated policy for what research stays proprietary vs. gets published.

Stated Theory of Change

Goodfire's stated theory of change has evolved significantly in its short existence and is now centered on intentional design: using interpretability tools to shape what AI models learn during training, not just inspect them afterward.

The causal chain: (1) Develop interpretability tools that extract meaningful semantic features from model internals --> (2) Use those features to decompose what a model is learning from each training example --> (3) Intervene on the training process to selectively promote or suppress specific learned behaviors --> (4) This produces models that are more aligned, less prone to hallucination, and more controllable --> (5) Simultaneously, use interpretability as a "microscope" to extract scientific knowledge from superhuman models.

Tom McGrath frames this as moving from selective breeding to genetic engineering for AI. The key principle is "don't fight backprop" -- rather than trying to override gradient descent, reshape the loss landscape so the model naturally learns what you want.

The safety-specific theory of change is: if we can understand and steer what models learn, we can build alignment into the training process rather than bolting it on afterward. This is complemented by the "train/test split" proposal -- use one set of interpretability techniques during training, a different set for auditing.

Revealed Theory of Change

Goodfire's actions reveal a more commercially pragmatic trajectory than the safety-first framing suggests:

Research evolution follows commercial viability. The company started with pure interpretability (SAEs, feature extraction) -- a safety tool. It evolved toward intentional design (training-time intervention) -- which is both a safety tool AND a capabilities accelerator. And it's simultaneously pursuing scientific abundance (knowledge extraction from models) -- which is primarily a commercial application with the strongest revenue path. The research trajectory moves steadily toward what generates enterprise value.

Customer portfolio reveals priorities. Rakuten (PII detection), Mayo Clinic (genomic medicine), Arc Institute (genomics), Prima Mente (Alzheimer's). These are commercial partnerships, not safety deployments. The Alzheimer's biomarker discovery is scientifically exciting but has no direct connection to reducing AI catastrophic risk.

The pivot from public API to partner-deployed platform concentrates access among paying enterprise customers and away from the broader safety research community. Open-source SAE releases partially offset this.

Funding trajectory drives toward commercial value. Seed-to-unicorn in 18 months with commercial VC investors (B Capital, Salesforce Ventures, Eric Schmidt) who expect returns. The board composition is opaque, but commercial investors dominate.

The revealed theory of change is: build a commercially successful interpretability platform that generates revenue from enterprise customers, while conducting foundational safety-relevant research that maintains credibility with the safety community and attracts top researchers. Safety research is a recruitment tool, a brand differentiator, and a genuine passion of the founders -- but it is not the revenue driver.

Key Assumptions

1. Interpretability will scale to frontier models.

Evidence for: Progress from toy models to 70B and 671B parameter models. SAEs, SPD, circuit tracing all scaling up. Kimi K2 (trillion-param) activation harvesting demonstrated.
Evidence against: No evidence that current techniques capture the computationally important aspects of frontier reasoning. The "Mind the Coherence Gap" evaluation found prompting outperforms feature steering. The 2023 critique notes: "Properties of models which are dangerous are not low-level features, but high-level behavioral abilities."
Testable: Yes -- can Goodfire's tools detect or prevent a known failure mode in a frontier model that evals miss?
If wrong: Goodfire becomes a niche scientific tool vendor, not a safety infrastructure company.

2. Intentional design won't undermine interpretability as a "test set."

Evidence for: RLFR result shows probes remain valid after training on a frozen copy. The frozen-copy approach structurally avoids direct gradient flow through the probe.
Evidence against: This is one result on one model on one task. OpenAI's obfuscated reward hacking paper shows models can learn to hide intent. JBlack's critique: "gradient-decomposition approach only works if interpretability is complete and accurate -- which it isn't."
Testable: Yes -- do probes remain valid after many rounds of intentional design, on diverse tasks, at frontier scale?
If wrong: Goodfire's core innovation is net-negative for safety, actively degrading the interpretability tools that others rely on for auditing.

3. A VC-funded PBC can sustain safety research under commercial pressure.

Evidence for: Ho's genuine concern about AI risk (left a successful company). McGrath's research career dedicated to this. PBC structure. Anthropic endorsement.
Evidence against: $1.25B valuation requires revenue growth. Investor base is entirely commercial. Board composition is opaque. PBC governance is legally weak. No ring-fenced safety research budget. Historical precedent (OpenAI's transition from nonprofit to capped-profit) shows mission drift.
Testable: Yes -- does the ratio of safety research to commercial work change over time?
If wrong: Goodfire becomes a commercial AI tooling company that maintains a safety veneer.

4. Third-party interpretability companies add value beyond in-house lab teams.

Evidence for: Cross-model perspective, independence from any single lab's incentives, ability to serve multiple customers. McGrath argues for the value of independent research.
Evidence against: Anthropic runs the world's largest interpretability team. Google DeepMind has significant interp efforts. Labs have access to their own models' full training data, compute, and architecture details that Goodfire doesn't.
Testable: Does Goodfire's research output match or exceed what Anthropic's team produces? Do customers choose Goodfire over building in-house?
If wrong: Goodfire's tools become redundant as major labs build superior in-house alternatives.

Strengths

World-class research team. Tom McGrath (founded DeepMind interp team), Lee Sharkey (SAE pioneer, Apollo co-founder), Nick Cammarata (OpenAI interp co-founder). This is arguably the strongest independent interpretability team outside the major labs.

Intellectually honest leadership. McGrath's intentional design manifesto is remarkably candid about risks, limitations, and open problems. He explicitly says the techniques "probably shouldn't be used on frontier models today." This is not the language of a company overselling its product.

Concrete empirical results. The RLFR hallucination reduction (58%, with probe transfer) is a real result, not just a vision document. The Alzheimer's biomarker discovery is genuine science. These are not trivially dismissable.

Unique positioning. Only commercial company with a serious argument for how interpretability translates to training-time alignment improvement. The intentional design vision is genuinely novel and potentially transformative if it works.

Speed of execution. Seed to unicorn in 18 months. From founding to published research results on hallucination reduction and scientific discovery within 20 months. This is exceptionally fast for a research-first company.

Weaknesses and Risks

The "Most Forbidden Technique" problem is real and underaddressed. The frozen-copy probe approach is a clever technical response, but it's one data point on one model. The fundamental concern -- that optimizing against interpretability tools degrades those tools -- has not been resolved. The train/test split proposal is "currently hypothetical." If intentional design goes wrong, it could actively harm the broader safety ecosystem by degrading interpretability's value as an auditing tool.

Commercial pressure will intensify. $1.25B valuation with commercial VCs demands revenue growth. The three pillars have very different commercial profiles: Scientific Abundance (pharma clients) generates revenue, Intentional Design (model developers) generates revenue, Understanding (pure research) does not. Expect the balance to shift toward commercial applications.

Governance is dangerously opaque. No public board composition. No detailed safety policy for intentional design. No external safety advisory board. No stated criteria for when these techniques are safe for frontier models. PBC structure is legally weak. This is inadequate for a company developing training-time AI intervention capabilities.

Single independent evaluation found prompting outperforms. The "Mind the Coherence Gap" evaluation is limited but directionally concerning. If feature steering doesn't reliably outperform prompting for behavior control, a large part of the product value proposition is in question.

Key claims lack independent validation. The RLFR hallucination result has no external replication. The intentional design algorithms are not fully published. The safety community cannot independently verify whether these techniques are net-positive or net-negative.

Potential for capabilities acceleration. Interpretability tools can be used for capabilities (optimizing models, reducing compute costs, improving training efficiency) as easily as for safety. Goodfire's tools in the hands of customers could accelerate capabilities without commensurate safety benefit.

Cross-References

Anthropic: Both investor and most relevant competitor. Anthropic's in-house team (Chris Olah, circuit tracing, attribution graphs) is the benchmark for interpretability research. Goodfire's independence is its value proposition -- cross-model, cross-architecture perspective. But if Anthropic's tools are superior for their own models, Goodfire's market narrows to non-Anthropic models.

Apollo Research: Lee Sharkey's former org. Apollo does safety evaluations (behavioral evals, scheming detection). Goodfire does interpretability-based understanding and intervention. Complementary rather than competing. Sharkey's move from Apollo to Goodfire suggests he sees more leverage in interpretability tools than in behavioral evals.

Timaeus: Focused on understanding learning dynamics ("developmental interpretability"). Goodfire cites their patterning work as inspiration. Complementary research direction. Timaeus is academic/nonprofit; Goodfire is commercial.

MIRI, ARC (Paul Christiano): Focused on theoretical alignment. Goodfire is more empirically driven. Different bets on what approach will work.

Transluce/Tilde/Wisent: Smaller competitors in commercial interpretability. Tilde ($8M from Khosla) and Wisent are early-stage. Goodfire has significant lead in funding, team, and research output.

What Would Change This Assessment

Upward:

Independent replication of RLFR results at frontier scale with diverse tasks
Publication of detailed safety policy with clear criteria for frontier deployment
Transparent board composition with safety-focused members
Evidence that intentional design techniques catch failure modes that behavioral evals miss
Revenue growth that doesn't come at the expense of safety research investment

Downward:

Safety researchers depart citing commercial pressure
Intentional design techniques shown to degrade probe accuracy across models (not just frozen copy)
Board composition reveals no safety-focused members
Research output shifts predominantly toward commercial applications
Major lab (Anthropic, DeepMind) builds superior in-house tools, reducing Goodfire to niche player

Self-Critique

Potential bias toward skepticism: I may be overweighting the structural criticisms (VC incentives, governance opacity) relative to the genuine technical accomplishments. The RLFR result is real. The Alzheimer's discovery is real. The team quality is extraordinary. A purely structural critique risks missing that this might simply be the most effective way to advance interpretability research given current institutional constraints.

Missing information that would change my view: Revenue data. Board composition. A detailed safety policy for intentional design. Any of these could significantly shift the assessment. The opaque governance could hide strong safeguards -- or weak ones. I don't know, and the absence of information is itself a finding.

Weakest claim: That the VC funding model will inevitably subordinate safety to commercial pressure. There are counterexamples (Anthropic itself is VC-funded with safety commitments). The PBC structure, while legally weak, does provide some protection. And the founders' genuine commitment to safety is evident in the evidence. My structural concern may be overweighted relative to the founders' actual intentions.

What a thoughtful disagreer would say: "You're applying nonprofit/EA standards to a for-profit company. Goodfire's approach -- making safety commercially viable -- is exactly how you scale safety research beyond the narrow EA funding base. The alternative (waiting for philanthropy to fund all safety work) doesn't scale. VC funding attracts talent that philanthropy can't. And the commercial customers are actually deploying interpretability in real systems, not just publishing papers."

Sources I should have checked but couldn't: DeepMind's SAE deprioritization article (failed to fetch). Haize Labs red-teaming blog (failed to fetch). Full text of OpenAI's obfuscated reward hacking paper. Any private communication between Anthropic and Goodfire about the investment terms.

Connected to (11)

MATScollaborator · Lee Sharkey

Timaeuscollaborator

Anthropiccollaborator

Apollo Researchstaff from · Lee Sharkey

Arc Institutecollaborator

Mayo Cliniccollaborator

Apart Researchcollaborator

Conjecturestaff from · Lee Sharkey

Google DeepMindstaff from · Tom McGrath

Haize Labscollaborator

OpenAIadvisor at · Nick Cammarata

Sources (58)

Every URL that was read during research.

1.Companygoodfire.ai
2.Understanding, Learning From, and Designing AI: Our Series Bgoodfire.ai
3.Announcing Our $50M Series A to Advance AI Interpretability Researchgoodfire.ai
4.Intentionally Designing the Future of AIgoodfire.ai
5.On Optimism for Interpretabilitygoodfire.ai
6.Goodfire Ember: Scaling Interpretability for Frontier Model Alignmentgoodfire.ai
7.Using Interpretability to Identify a Novel Class of Alzheimer's Biomarkersgoodfire.ai
8.Towards Scalable Parameter Decompositiongoodfire.ai
9.Features as Rewards: Using Interpretability to Reduce Hallucinationsgoodfire.ai
10.Mapping the Latent Space of Llama 3.3 70Bgoodfire.ai
11.Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8Bgoodfire.ai
12.Report: Goodfire Business Breakdown & Founding Story | Contrary Researchresearch.contrary.com
13.Goodfire’s Eric Ho on Mapping the Mind of a Neural Netsequoiacap.com
14.Popular Mechanistic Interpretability: Goodfire Lights the Way to AI Safetycognitiverevolution.ai
15.Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrathcognitiverevolution.ai
16.Untangling Neural Network Mechanisms: Goodfire's Lee Sharkey on Parameter-based Interpretabilitycognitiverevolution.ai
17.The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AIlatent.space
18.Goodfire: Building Interpretable AIlsvp.com
19.Leading Goodfire’s $50M Series A to Interpret How AI Models Think | Menlo Venturesmenlovc.com
20.Anthropic backs Goodfire in $50M Series A funding to decode AI models, marking first-ever startup investmenttechstartups.com
21.41 - Lee Sharkey on Attribution-based Parameter Decompositionaxrp.net
22.Goodfire raises $150M in funding to enhance its AI interpretability platform - SiliconANGLEsiliconangle.com
23.AI Lab Goodfire Raises $150M at $1.25B Valuation to Design Models with Interpretabilityprnewswire.com
24.Mapping the Mind of a Neural Net: Goodfire's Eric Ho on the Future of Interpretabilityinferencebysequoia.substack.com
25.Worktommcgrath.github.io
26."[Question] Goodfire and Training on Interpretability"lesswrong.com
27.Mind the Coherence Gap: Lessons from Steering Llama with Goodfiregreaterwrong.com
28.Behind the Features: Goodfire's Interpretability Tools in Action | Apart Researchapartresearch.com
29.The Living Edge Spotlight #1: Mechanistic Interpretabilitythelivingedge.substack.com
30.Careersgoodfire.ai
31.Researchgoodfire.ai
32.Can startups be impactful in AI safety?greaterwrong.com
33.This startup wants to reprogram the mind of AI—and just got $50 million to do itfastcompanyme.com
34.GitHub - goodfire-ai/goodfire-sdk: Ember is a hosted API/SDK that lets you shape AI model behavior by directly controlling a model's internal units of computation, or "features". With Ember, you can modify features to precisely control model outputs, or use them as building blocks for tasks like classification.github.com
35.Goodfire Lands $50M From Anthropic & Silicon Valley Giants To Open The Black Box Of AI — Before It Rewrites Us Firstfinance.yahoo.com
36.Goodfire Raises $50M Series A to Advance AI Interpretability Researchprnewswire.com
37.The Sequence AI of the Week #805: Goodfire and the Era of AI Interpretabilitythesequence.substack.com
38.Your Showcase Primer: TypeSafe AI, Fleet AI, Goodfirenewsletter.foundersysk.com
39.Lee Sharkey at MATS: Summer 2026matsprogram.org
40.Goodfire: $150 Million Series B At $1.25 Billion Valuation Raised For Interpretability AI Labpulse2.com
41.Goodfire Announces Collaboration to Advance Genomic Medicine with AI Interpretabilitygoodfire.ai
42.Goodfire Raises $150M Series B to Advance the Frontier of AI Interpretabilityworkbench.substack.com
43.Some for-profit AI alignment org ideasgreaterwrong.com
44.Goodfire - Companies - South Park Commonssouthparkcommons.com
45.Our Approach to Safety at Goodfiregoodfire.ai
46.Feature Steering for Reliable and Expressive AI Engineeringgoodfire.ai
47.Under the Hood of a Reasoning Modelgoodfire.ai
48.Painting With Concepts Using Diffusion Model Latentsgoodfire.ai
49.You and Your Research Agent: Lessons From Using Agents for Interpretability Researchgoodfire.ai
50.How Rakuten secures reliable AI experiences for 44M+ monthly usersgoodfire.ai
51.Customer Storiesgoodfire.ai
52.It Is Reasonable To Research How To Use Model Internals In Traininggreaterwrong.com
53.In (highly contingent!) defense of interpretability-in-the-loop ML traininggreaterwrong.com
54.Against Almost Every Theory of Impact of Interpretabilitygreaterwrong.com
55.Bloggoodfire.ai
56.Interpretability Infrastructure at Frontier Scale: Harvesting Activations from a Trillion-Parameter Modelgoodfire.ai
57.Announcing Goodfire’s Fellowship Program for Interpretability Researchgoodfire.ai
58.Replicating Circuit Tracing for a Simple Known Mechanismgoodfire.ai