Theory of Change
Transluce's theory of change is that AI safety requires independent, third-party oversight -- companies cannot be "the primary arbiters of their safety, due to the conflict of interest with commercial priorities." Transluce builds open, scalable tools for auditing AI systems and makes them available to external evaluators, researchers, and governments.
The causal chain: (1) Build automated interpretability tools that use AI to understand AI -- specialized smaller models can audit larger ones. (2) Open-source these tools so anyone can run audits, not just the labs. (3) Conduct public audits that demonstrate model problems and create pressure for improvement. (4) Build institutional infrastructure (AI Evaluator Forum, AEF-1 standard) so third-party evaluation becomes standard practice. (5) The resulting ecosystem of independent evaluators provides a check on labs that internal safety teams cannot.
Jacob Steinhardt's framing: AI systems are complex adaptive systems, comparable to ecosystems or financial markets, where "straightforward attempts to control their behavior lead to unintended consequences." Safety requires diverse, overlapping approaches -- measurement infrastructure built now enables action later, analogous to how climate measurement laid the groundwork for climate policy.
What They Do
Public auditing: Transluce's most visible work is investigating and publicizing model problems. Their o3 truthfulness investigation (April 2025) found that OpenAI's reasoning model "frequently fabricates actions it took to fulfill user requests, and elaborately justifies the fabrications when confronted." O-series models hallucinate at 2-4x the rate of GPT-series. This got TechCrunch coverage and hypothesized causes (outcome-based RL, discarded chains of thought) that informed public understanding. Separately, their PRBO method found self-harm encouragement, conspiracy theories, and other pathological behaviors in Qwen, Llama, and DeepSeek models.
Docent platform: Adopted by 25+ organizations including Anthropic (used for Claude 4 pre-deployment safety analysis), DeepMind, METR, Redwood, Apollo, Palisade, Princeton, and Bridgewater. Enables analysis of AI agent transcripts -- finds corrupted tasks, unintended solutions (e.g., an agent reading a flag from a Dockerfile instead of solving a challenge), and unexpected behaviors. This is probably their most impactful product to date.
Investigator agents: Trained models that automatically elicit behaviors from target systems. An 8B-parameter model jailbreaks GPT-5 (28% on thinking mode), Claude Opus 4.1, and Gemini 2.5 Pro. 98.1% string elicitation rate. Achieves 65.8% transfer rate to GPT-4o from a Llama-8B investigator.
Automated neuron descriptions: Described all 458,752 MLP neurons in Llama-3.1-8B at $0.046/neuron, outperforming human quality. 12,000+ downloads.
Predictive Concept Decoders (PCDs): Encoder-decoder architecture for interpretability, outperforming LatentQA for probing model representations.
Monitor: Open-source observability interface for exploring neural network internals with activation/attribution modes.
Field-building: Co-founded AI Evaluator Forum (December 2025) at NeurIPS with METR, RAND, SecureBio, Princeton HAL, CIP, Meridian Labs, and AVERI. Released AEF-1 standard for third-party evaluation independence. MATS mentoring (Summer 2026).
Output trajectory since October 2024 launch: roughly one major publication or product per month. 10 open-source repositories on GitHub.
Key People
Jacob Steinhardt (CEO, co-founder): On leave from UC Berkeley (Statistics/EECS). PhD Stanford under Percy Liang. PostDoc at OpenAI. Co-author "Concrete Problems in AI Safety" (with Dario Amodei, Chris Olah, Paul Christiano). Designed MMLU benchmark. Technical advisor to Open Philanthropy. Co-founded SPARC (2012) with Paul Christiano, Andrew Critch, Anna Salamon. AI2050 Fellow (Schmidt Sciences). Has been working on AI safety since 2011. Describes himself as a "worried optimist." Deep intellectual investment in measurement, forecasting, and the complex systems perspective on AI safety.
Sarah Schwettmann (Chief Scientist, co-founder): Cognitive neuroscientist. PhD MIT (Brain and Cognitive Sciences, Tenenbaum/Torralba). Built MAIA -- the first large-scale automated interpretability agent pipeline at MIT CSAIL. Her neuroscience-to-interpretability pathway is distinctive. No long-form public interview about Transluce's vision found.
Conrad Stosz (Head of Governance): Former director of AI at White House OMB. Former Acting Director of U.S. Center for AI Standards and Innovation. Joined October 2025. Signals serious policy engagement.
Team: ~20 people. Founding team includes Neil Chowdhury (ex-OpenAI safety, GPT-4o/o1) and Kevin Meng (MIT PhD, created ROME/MEMIT model editing).
Money and Incentives
Budget: $11M fundraising target for the current giving season. No total historical budget disclosed.
Revenue split: ~80% philanthropic, ~20% earned revenue from private companies and governments. Projects earned revenue to increase over time.
Revenue model: Open core. Core oversight tools are open source; hosted services and advanced features generate revenue.
Budget allocation: 60% scaling existing research, 15% evaluating model releases, 10% governance and public accountability, 10% new research, 5% overhead.
Named donors: None identified. This is the biggest financial information gap. We cannot assess donor concentration, lab dependencies, or potential conflicts of interest without knowing who funds Transluce.
Pre-Transluce grants to Steinhardt: $156K from Coefficient Giving/Open Philanthropy (2018-2022) -- all to Steinhardt personally or UC Berkeley, not to Transluce as an entity. Additional grants through Schmidt Sciences (AI2050 Fellowship) and other academic sources.
No 990 data: Organization is too new for IRS filings to be publicly available. EIN unknown.
Structural comparison with Goodfire: Transluce operates on roughly 1/20th of Goodfire's resources (~$11M philanthropic vs. $207M VC). No equity investors, no valuation pressure, no board members seeking returns. But also: no ability to pay $200-400K salaries, no massive compute budget, and dependency on a small number of likely philanthropic funders.
Incentive analysis: The nonprofit structure protects against the commercial pressure that threatens for-profit interpretability companies (Goodfire's $1.25B valuation demands revenue growth). But the 80/20 philanthropic/earned split creates its own dependency. If a small number of funders (potentially including AI labs that Transluce audits) provide most of the 80% philanthropic portion, independence could be compromised without any visible pressure.
Steinhardt's role as Open Philanthropy technical advisor is a potential conflict of interest -- he advises the largest funder in AI safety while leading an organization that likely depends on similar funding sources. No public recusal policy exists.
What Others Say
Neel Nanda (Google DeepMind, mechanistic interpretability lead), in a 30K-word 80,000 Hours interview: "My most ambitious vision of mech interp is probably dead." Full model understanding is not achievable. But: "I've gone from there's a low chance this is an incredibly big deal to there's a high chance this is a medium big deal." Advocates "swiss cheese model" of safety -- multiple imperfect layers. Simple probes work surprisingly well for detecting harmful intent. Notes that chain-of-thought monitoring is "kind of wild" in how useful it is, but estimates it will be useful for "maybe one to three years, but not indefinitely."
This assessment is important for Transluce because: (a) Transluce's practical tool-building approach aligns better with Nanda's revised pragmatic vision than with the original ambitious mech interp vision; (b) Docent is essentially CoT monitoring for agents, and Nanda's timeline gives it 1-3 years of relevance; (c) Nanda's endorsement of probes supports Transluce's PCD and decoder-based approaches.
Charbel-Raphael (comprehensive 11K-word critique): Argues against nearly every theory of impact for interpretability. "Auditing deception with interp is out of reach." Enumerative safety is "doomed from the start." Interpretability is primarily post-hoc, not predictive. Better to invest in other approaches. Neel Nanda partially agreed with this critique.
This matters because most of these critiques target the "fully understand the model" vision. Transluce's approach -- build practical auditing tools, find model problems, publish them -- is a narrower claim that partially sidesteps these critiques.
Dario Amodei (Anthropic CEO): "We are in a race between interpretability and model intelligence." Goal to detect most model problems by 2027. Calls for more funding and talent. Describes interpretability as a "test set" for alignment -- an independent check on whether alignment techniques are working. Supports third-party evaluation ecosystem.
Stanford HAI workshop: Third-party evaluators face limited model access, legal threats, lack of standardization, and resource constraints. The evaluation approach cannot prove a model is safe, only indicate it is unsafe.
No Transluce-specific criticism found. All criticism is about the interpretability/evaluation approach generally. The org is too new and too small to have attracted targeted critiques.
What's Absent
- Named donors: Zero specific funders identified. Cannot assess independence.
- 990 financial data: Too new. EIN unknown.
- Sarah Schwettmann's views: No long-form interview or essay about her theory of change.
- Development period (July 2023 - October 2024): 15 months of pre-launch activity with no public information.
- Paid vs. free adoption of Docent: No clarity on whether the 25+ organizations are customers or free users.
- Steinhardt's explicit AI risk estimates: Despite his emphasis on forecasting, no public P(doom) or timeline.
- Independent evaluation of Transluce's tools: No external assessment of whether their tools improve safety outcomes vs. simpler alternatives.
- Conflict of interest policy: No disclosed policy for Steinhardt's dual role as CEO and Open Philanthropy technical advisor.
Recommended Reading
Neel Nanda, 80,000 Hours podcast -- "The race to read AI minds." The most candid assessment of interpretability's realistic limits and value, from the field's leading researcher. Essential context for any interpretability org. (80000hours.org)
Jacob Steinhardt, Generally Intelligent podcast (2021) -- Deep dive into Steinhardt's intellectual journey: from robustness to measurement to safety. Reveals core worldview. Predates Transluce but explains why it exists. (imbue.com)
"Against Almost Every Theory of Impact of Interpretability" by Charbel-Raphael -- The strongest written case against interpretability as a safety strategy. Challenges 20+ theories of impact. Nanda partially agreed. (LessWrong)
Steinhardt, "Complex Systems are Hard to Control" (2023) -- The worldview behind Transluce. Neural networks as complex adaptive systems requiring fundamentally different approaches than traditional engineering. (bounded-regret.ghost.io)
Dario Amodei, "The Urgency of Interpretability" (2025) -- The strongest case for interpretability from industry. Describes the race between interpretability and model intelligence. (darioamodei.com)