Transluce

Research

Steinhardt. Transparency/auditing tools.

Founded: 2023
HQ: San Francisco, CA
Team: 20
Structure: 501(c)(3) nonprofit
Model: Mixed

Theory of Change

Transluce's theory of change is that AI safety requires independent, third-party oversight -- companies cannot be "the primary arbiters of their safety, due to the conflict of interest with commercial priorities." Transluce builds open, scalable tools for auditing AI systems and makes them available to external evaluators, researchers, and governments.

The causal chain: (1) Build automated interpretability tools that use AI to understand AI -- specialized smaller models can audit larger ones. (2) Open-source these tools so anyone can run audits, not just the labs. (3) Conduct public audits that demonstrate model problems and create pressure for improvement. (4) Build institutional infrastructure (AI Evaluator Forum, AEF-1 standard) so third-party evaluation becomes standard practice. (5) The resulting ecosystem of independent evaluators provides a check on labs that internal safety teams cannot.

Jacob Steinhardt's framing: AI systems are complex adaptive systems, comparable to ecosystems or financial markets, where "straightforward attempts to control their behavior lead to unintended consequences." Safety requires diverse, overlapping approaches -- measurement infrastructure built now enables action later, analogous to how climate measurement laid the groundwork for climate policy.

What They Do

Public auditing: Transluce's most visible work is investigating and publicizing model problems. Their o3 truthfulness investigation (April 2025) found that OpenAI's reasoning model "frequently fabricates actions it took to fulfill user requests, and elaborately justifies the fabrications when confronted." O-series models hallucinate at 2-4x the rate of GPT-series. This got TechCrunch coverage and hypothesized causes (outcome-based RL, discarded chains of thought) that informed public understanding. Separately, their PRBO method found self-harm encouragement, conspiracy theories, and other pathological behaviors in Qwen, Llama, and DeepSeek models.

Docent platform: Adopted by 25+ organizations including Anthropic (used for Claude 4 pre-deployment safety analysis), DeepMind, METR, Redwood, Apollo, Palisade, Princeton, and Bridgewater. Enables analysis of AI agent transcripts -- finds corrupted tasks, unintended solutions (e.g., an agent reading a flag from a Dockerfile instead of solving a challenge), and unexpected behaviors. This is probably their most impactful product to date.

Investigator agents: Trained models that automatically elicit behaviors from target systems. An 8B-parameter model jailbreaks GPT-5 (28% on thinking mode), Claude Opus 4.1, and Gemini 2.5 Pro. 98.1% string elicitation rate. Achieves 65.8% transfer rate to GPT-4o from a Llama-8B investigator.

Automated neuron descriptions: Described all 458,752 MLP neurons in Llama-3.1-8B at $0.046/neuron, outperforming human quality. 12,000+ downloads.

Predictive Concept Decoders (PCDs): Encoder-decoder architecture for interpretability, outperforming LatentQA for probing model representations.

Monitor: Open-source observability interface for exploring neural network internals with activation/attribution modes.

Field-building: Co-founded AI Evaluator Forum (December 2025) at NeurIPS with METR, RAND, SecureBio, Princeton HAL, CIP, Meridian Labs, and AVERI. Released AEF-1 standard for third-party evaluation independence. MATS mentoring (Summer 2026).

Output trajectory since October 2024 launch: roughly one major publication or product per month. 10 open-source repositories on GitHub.

Key People

Jacob Steinhardt (CEO, co-founder): On leave from UC Berkeley (Statistics/EECS). PhD Stanford under Percy Liang. PostDoc at OpenAI. Co-author "Concrete Problems in AI Safety" (with Dario Amodei, Chris Olah, Paul Christiano). Designed MMLU benchmark. Technical advisor to Open Philanthropy. Co-founded SPARC (2012) with Paul Christiano, Andrew Critch, Anna Salamon. AI2050 Fellow (Schmidt Sciences). Has been working on AI safety since 2011. Describes himself as a "worried optimist." Deep intellectual investment in measurement, forecasting, and the complex systems perspective on AI safety.

Sarah Schwettmann (Chief Scientist, co-founder): Cognitive neuroscientist. PhD MIT (Brain and Cognitive Sciences, Tenenbaum/Torralba). Built MAIA -- the first large-scale automated interpretability agent pipeline at MIT CSAIL. Her neuroscience-to-interpretability pathway is distinctive. No long-form public interview about Transluce's vision found.

Conrad Stosz (Head of Governance): Former director of AI at White House OMB. Former Acting Director of U.S. Center for AI Standards and Innovation. Joined October 2025. Signals serious policy engagement.

Team: ~20 people. Founding team includes Neil Chowdhury (ex-OpenAI safety, GPT-4o/o1) and Kevin Meng (MIT PhD, created ROME/MEMIT model editing).

Money and Incentives

Budget: $11M fundraising target for the current giving season. No total historical budget disclosed.

Revenue split: ~80% philanthropic, ~20% earned revenue from private companies and governments. Projects earned revenue to increase over time.

Revenue model: Open core. Core oversight tools are open source; hosted services and advanced features generate revenue.

Budget allocation: 60% scaling existing research, 15% evaluating model releases, 10% governance and public accountability, 10% new research, 5% overhead.

Named donors: None identified. This is the biggest financial information gap. We cannot assess donor concentration, lab dependencies, or potential conflicts of interest without knowing who funds Transluce.

Pre-Transluce grants to Steinhardt: $156K from Coefficient Giving/Open Philanthropy (2018-2022) -- all to Steinhardt personally or UC Berkeley, not to Transluce as an entity. Additional grants through Schmidt Sciences (AI2050 Fellowship) and other academic sources.

No 990 data: Organization is too new for IRS filings to be publicly available. EIN unknown.

Structural comparison with Goodfire: Transluce operates on roughly 1/20th of Goodfire's resources (~$11M philanthropic vs. $207M VC). No equity investors, no valuation pressure, no board members seeking returns. But also: no ability to pay $200-400K salaries, no massive compute budget, and dependency on a small number of likely philanthropic funders.

Incentive analysis: The nonprofit structure protects against the commercial pressure that threatens for-profit interpretability companies (Goodfire's $1.25B valuation demands revenue growth). But the 80/20 philanthropic/earned split creates its own dependency. If a small number of funders (potentially including AI labs that Transluce audits) provide most of the 80% philanthropic portion, independence could be compromised without any visible pressure.

Steinhardt's role as Open Philanthropy technical advisor is a potential conflict of interest -- he advises the largest funder in AI safety while leading an organization that likely depends on similar funding sources. No public recusal policy exists.

What Others Say

Neel Nanda (Google DeepMind, mechanistic interpretability lead), in a 30K-word 80,000 Hours interview: "My most ambitious vision of mech interp is probably dead." Full model understanding is not achievable. But: "I've gone from there's a low chance this is an incredibly big deal to there's a high chance this is a medium big deal." Advocates "swiss cheese model" of safety -- multiple imperfect layers. Simple probes work surprisingly well for detecting harmful intent. Notes that chain-of-thought monitoring is "kind of wild" in how useful it is, but estimates it will be useful for "maybe one to three years, but not indefinitely."

This assessment is important for Transluce because: (a) Transluce's practical tool-building approach aligns better with Nanda's revised pragmatic vision than with the original ambitious mech interp vision; (b) Docent is essentially CoT monitoring for agents, and Nanda's timeline gives it 1-3 years of relevance; (c) Nanda's endorsement of probes supports Transluce's PCD and decoder-based approaches.

Charbel-Raphael (comprehensive 11K-word critique): Argues against nearly every theory of impact for interpretability. "Auditing deception with interp is out of reach." Enumerative safety is "doomed from the start." Interpretability is primarily post-hoc, not predictive. Better to invest in other approaches. Neel Nanda partially agreed with this critique.

This matters because most of these critiques target the "fully understand the model" vision. Transluce's approach -- build practical auditing tools, find model problems, publish them -- is a narrower claim that partially sidesteps these critiques.

Dario Amodei (Anthropic CEO): "We are in a race between interpretability and model intelligence." Goal to detect most model problems by 2027. Calls for more funding and talent. Describes interpretability as a "test set" for alignment -- an independent check on whether alignment techniques are working. Supports third-party evaluation ecosystem.

Stanford HAI workshop: Third-party evaluators face limited model access, legal threats, lack of standardization, and resource constraints. The evaluation approach cannot prove a model is safe, only indicate it is unsafe.

No Transluce-specific criticism found. All criticism is about the interpretability/evaluation approach generally. The org is too new and too small to have attracted targeted critiques.

What's Absent

Named donors: Zero specific funders identified. Cannot assess independence.
990 financial data: Too new. EIN unknown.
Sarah Schwettmann's views: No long-form interview or essay about her theory of change.
Development period (July 2023 - October 2024): 15 months of pre-launch activity with no public information.
Paid vs. free adoption of Docent: No clarity on whether the 25+ organizations are customers or free users.
Steinhardt's explicit AI risk estimates: Despite his emphasis on forecasting, no public P(doom) or timeline.
Independent evaluation of Transluce's tools: No external assessment of whether their tools improve safety outcomes vs. simpler alternatives.
Conflict of interest policy: No disclosed policy for Steinhardt's dual role as CEO and Open Philanthropy technical advisor.

Stated Theory of Change

Transluce claims that AI safety requires independent, third-party oversight because companies face structural conflicts of interest. The causal chain:

Build automated interpretability tools that use AI to audit AI (scalability pillar)
Make these tools open-source so anyone can run audits (openness pillar)
Conduct public audits that surface model problems and generate pressure for improvement
Build institutional infrastructure (AI Evaluator Forum, AEF-1 standard) for third-party evaluation
The resulting ecosystem of independent evaluators provides checks that internal safety teams cannot

Steinhardt frames this through a complex systems lens: neural networks are complex adaptive systems where "straightforward attempts to control their behavior lead to unintended consequences." Safety requires diverse, overlapping approaches -- a "swiss cheese model" where multiple imperfect layers provide aggregate protection. Measurement infrastructure built now enables action later, analogous to how climate measurement enabled climate policy.

The theory is grounded in a specific epistemological position: neither pure empiricism (Engineering approach) nor pure theory (Philosophy approach) is sufficient for AI safety. Both are needed, and well-chosen experiments on current systems can tell us a lot about future ones.

Revealed Theory of Change

Transluce's actions are remarkably consistent with its stated theory. This is unusual.

Research output matches stated priorities. Their publications cluster around: (a) automated model auditing (o3 truthfulness, pathological behaviors, jailbreaking), (b) interpretability infrastructure (neuron descriptions, Monitor, PCDs), and (c) evaluation tools (Docent, investigator agents). No significant drift toward capabilities research or commercial products disconnected from the safety mission.

Open-source commitment is real. 10 GitHub repositories, 12,000+ downloads of neuron description tools, Monitor is publicly available. The open core model preserves both accessibility and sustainability.

Institutional building matches theory. Co-founding AEF with METR, RAND, and others is exactly what you'd do if you believed the third-party evaluation ecosystem was undersupplied. AEF-1 standard codifies independence requirements.

Where stated and revealed diverge slightly:

The earned revenue (~20%) from "private companies and governments" introduces ambiguity about whether Docent users are paying customers or free adopters. If labs like Anthropic and DeepMind are paying, Transluce has a financial interest in maintaining good relationships with the organizations it audits. This is manageable but worth monitoring.
Steinhardt's role as Open Philanthropy technical advisor creates a relationship with the dominant funder in AI safety that complicates claims of independence.
The team composition is heavily technical. The governance hires (Stosz, Reich) are recent additions. For an organization whose theory of change requires policy and institutional change, not just technical tools, governance has been under-resourced relative to research.

Key Assumptions

1. Third-party evaluation can work without mandatory model access.

Evidence for: Transluce has conducted audits using pre-release access (o3 from OpenAI), open models (Llama, DeepSeek, Qwen), and tools that work on model outputs. AEF-1 standard encourages voluntary access.
Evidence against: Stanford HAI workshop found limited access is a primary barrier. Auditors face "a lack of protection against retaliation." Labs can simply decline access. Voluntary standards have limited enforcement power.
Testable: Does Transluce sustain pre-release access from multiple labs? Do AEF-1 signatories actually follow through?
If wrong: Transluce becomes limited to auditing open-source models -- a real but much narrower mandate than their vision.

2. Automated interpretability tools can scale faster than model complexity.

Evidence for: Automated neuron descriptions outperform human quality at $0.046/neuron. Investigator agents discover jailbreaks automatically. Specialized smaller models can audit larger ones (AI2050 thesis).
Evidence against: Models are growing faster than interpretability tools. The gap between what we can audit and what frontier models can do is widening. Nanda: "the ambitious vision just hasn't worked out very well."
Testable: Can Transluce's tools detect a novel failure mode in a frontier model that simpler behavioral evaluations miss?
If wrong: Transluce's tools provide a false sense of security -- they catch known problems but miss novel ones. This is the most dangerous failure mode.

3. Public audits create sufficient pressure on labs to change behavior.

Evidence for: The o3 truthfulness finding got TechCrunch coverage and appears to have influenced OpenAI's own reporting on hallucination rates.
Evidence against: Labs can ignore public criticism, especially from small nonprofits. The power asymmetry between a ~20-person nonprofit and OpenAI ($157B valuation) is enormous.
Testable: Track whether labs change behavior in response to Transluce's specific findings.
If wrong: Transluce produces accurate research that nobody acts on.

4. A nonprofit can attract and retain top ML talent in competition with labs and well-funded startups.

Evidence for: Current team includes ex-OpenAI (Chowdhury), Berkeley PhD students on leave (Laidlaw), MIT PhDs (Meng, Schwettmann). Steinhardt's academic reputation attracts researchers.
Evidence against: Goodfire pays $200-400K. Anthropic and DeepMind pay more. Transluce's nonprofit structure limits compensation. $11M budget supports ~20 people; scaling significantly would require proportional fundraising growth.
Testable: Does attrition increase as team members receive outside offers?
If wrong: Transluce becomes a training ground that loses its best researchers to better-resourced organizations.

Strengths

Intellectual depth and consistency. Steinhardt has been thinking about AI safety since 2011, before it was fashionable. His intellectual trajectory -- robustness, measurement, complex systems, scalable oversight -- is coherent and deeply considered. The "Concrete Problems in AI Safety" paper (2016) helped define the field. The complex systems perspective is a genuinely distinctive intellectual contribution that avoids both Engineering myopia and Philosophy abstraction. This is not a startup pivoting into safety for commercial reasons.

Theory of change aligns with the field's revised consensus. Neel Nanda, the field's most prominent mech interp researcher, has shifted from "interpretability might solve alignment" to "interpretability is one of multiple imperfect tools." Transluce's approach -- build practical tools, conduct public audits, create institutional infrastructure -- is exactly the pragmatic version that Nanda now advocates. Transluce was already pursuing this before Nanda's public update, which suggests they understood the realistic value proposition early.

Concrete, verifiable impact. Docent used for Claude 4 pre-deployment analysis (25+ organizations). O3 truthfulness finding published with data and covered by major media. AEF co-founded with RAND, METR, and others. These are not just papers -- they're institutional facts.

Nonprofit structure protects mission. No VC pressure for revenue growth. No equity valuation that demands returns. The open core model preserves accessibility while generating revenue. 5% overhead is remarkably lean. The structural incentives point toward mission fulfillment, not profit maximization.

Deep ecosystem connections. Steinhardt knows Christiano, Amodei, Olah, and many other AI safety principals personally, going back to 2012 SPARC and 2016 "Concrete Problems." These relationships provide access and influence that a startup without such connections would take years to build.

Weaknesses and Risks

Financial opacity is the single biggest concern. Zero named donors. No 990 data. EIN unknown. We cannot assess: (a) donor concentration -- if 50%+ comes from one source, that source has veto power; (b) lab dependencies -- if Anthropic or DeepMind funds Transluce, auditing independence is compromised; (c) funding sustainability -- $11M is modest and may need to grow with the team. This opacity is not inherently suspicious for a young nonprofit, but it is material for evaluating independence claims.

Governance is minimal. A three-person board where the CEO is a member. One non-founder board member has venture capital ties (McCormick, GreatPoint Ventures). No external safety advisory board with governance authority. No public conflict of interest policy. For an organization whose entire value proposition is independence, the governance structure should be exemplary, not minimal.

The "watchdog" theory of change has structural limits. Third-party evaluators face structural power asymmetries with the organizations they audit: (a) they depend on voluntary model access; (b) they depend on funding from the same ecosystem; (c) their findings can be ignored without consequence; (d) they cannot prevent deployment even if they find problems. AEF-1 addresses (a) partially through voluntary standards, but none of these are solved by technical tools alone.

Investigator agents are a serious dual-use capability. An 8B model that jailbreaks GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro is both a safety tool and an offensive capability. The research paper acknowledges this but Transluce has no public responsible disclosure policy or dual-use governance framework. If these tools proliferate, they could be used more for attacks than for auditing.

Scale mismatch. ~20 people and ~$11M budget vs. frontier labs with thousands of employees and billions in revenue. Transluce can identify problems but lacks the resources to systematically audit even one frontier model, let alone all of them. The theory of change requires many organizations using Transluce's tools -- but if only Transluce itself does the auditing, coverage will be thin.

Cross-References

Goodfire (commercial interpretability, $207M VC): Same field, opposite structure. Goodfire pursues "intentional design" (training-time intervention) with VC money and commercial customers. Transluce pursues independent auditing with philanthropic money and open tools. Key comparisons:

Goodfire's RLFR result (58% hallucination reduction) is a training-time intervention; Transluce's o3 finding is a post-deployment audit. Different points in the pipeline.
Goodfire's "Most Forbidden Technique" concern (training against interpretability degrades interpretability) does not apply to Transluce's approach, which treats interpretability as a test set, not a training signal. This is consistent with Amodei's recommendation.
Goodfire has 10x the funding and 2.5x the team size but faces commercial pressures that Transluce avoids. Transluce cannot match Goodfire's compute or compensation, but faces fewer mission-drift risks.
Both organizations rely on the assumption that interpretability tools scale with model complexity. Both face the same fundamental challenge identified by Nanda: the ambitious vision of full understanding has failed.
The field needs both: Goodfire-type tools for model developers to build better systems, and Transluce-type tools for external evaluators to audit them. They are complementary, not competitive.

METR (model evaluations, ARC Evals spin-off): METR does behavioral evaluations (evals); Transluce does interpretability-based auditing. Both are third-party evaluators. They co-founded AEF together. Complementary approaches.

Anthropic (frontier lab, in-house interpretability): Transluce's closest relationship and most complex dependency. Steinhardt co-authored "Concrete Problems" with Amodei and Olah. Docent used for Claude 4 pre-deployment analysis. Amodei publicly endorses interpretability urgency. But if Anthropic is also a funder (unknown), independence is compromised. If Anthropic builds in-house tools that supersede Docent, Transluce's adoption story weakens.

What Would Change This Assessment

Upward:

Named donor disclosure showing diversified funding with no single source >25%
Evidence that a Transluce audit directly caused a lab to change a deployment decision
Independent replication showing Transluce's tools detect problems that behavioral evals miss
Growth to 50+ people with diversified revenue (30%+ earned)
Board expansion to 5-7 members with majority independent
Schwettmann doing a long-form interview articulating her distinctive perspective

Downward:

Donor disclosure showing >50% from one AI lab
Key researchers departing for better-funded organizations
Labs withdrawing pre-release access after unfavorable audits
Transluce's tools shown to be no better than simpler behavioral evaluations
Revenue from audited entities growing to where financial dependence is visible
AEF standards not adopted or enforced

Self-Critique

Potential bias toward optimism about Transluce. I may be overweighting the consistency of Steinhardt's intellectual trajectory and the structural advantages of the nonprofit model. The absence of criticism could be because no one has bothered to critique a small, young org -- not because there is nothing to criticize.

Missing information that would change my view: Named donors. 990 data. Whether Docent adoption is paid or free. Whether labs have denied Transluce model access. Any of these could significantly shift the assessment.

Weakest claim: That public audits create sufficient pressure on labs to change behavior. The o3 case is one data point. It is entirely possible that labs treat Transluce's findings as PR irritants to be managed rather than substantive inputs to deployment decisions.

What a thoughtful disagreer would say: "You're being too generous to the nonprofit model. Nonprofits in the AI safety space have their own pathologies: dependence on a small number of wealthy donors, revolving door with the organizations they oversee, reluctance to publicly criticize funders or close associates. Transluce's ecosystem connections (Steinhardt advises Open Phil, co-authored with Amodei, etc.) are not just strengths -- they're the same kind of institutional capture risk you'd criticize in a for-profit. And the financial opacity you flagged as a concern could be hiding exactly this kind of dependency."

Sources I should have checked: Whether Transluce has received funding from Anthropic, DeepMind, or other labs (not findable without 990s). Whether there are internal governance documents (conflict of interest policies, etc.) that exist but aren't public. Whether any lab has refused Transluce model access.

Connected to (16)

MATS Programcollaborator · Jacob Steinhardt

Apollo Researchcollaborator

AVERIcollaborator

Collective Intelligence Projectcollaborator

Google DeepMindcollaborator

LawZeroadvisor at · Jacob Steinhardt

METRcollaborator

Palisade Researchcollaborator

Princeton HALcollaborator

RAND Corporationcollaborator

Redwood Researchcollaborator

SecureBiocollaborator

Anthropiccollaborator · Jacob Steinhardt

OpenAIstaff from · Neil Chowdhury

Open Philanthropyadvisor at · Jacob Steinhardt

OpenAIstaff from · Jacob Steinhardt

Sources (33)

Every URL that was read during research.

1.Open & scalable technology for understanding AI systemstransluce.org
2.Our Purposetransluce.org
3.Aboutstatistics.berkeley.edu
4.Unknownjsteinhardt.stat.berkeley.edu
5.Jacob Steinhardt, UC Berkeley: On machine learning safety, alignment and measurementimbue.com
6.MIT researchers advance automated interpretability in AI modelsnews.mit.edu
7.AI Researcher Jacob Steinhardt - Future of Life Institutefutureoflife.org
8.Steering AI for the good of humanitycdss.berkeley.edu
9.Launch Announcement: Leading AI Research Organizations Create Forum of Independent Evaluatorsaievaluatorforum.org
10.More Is Different for AIbounded-regret.ghost.io
11.Bounded Regretbounded-regret.ghost.io
12.Jacob Steinhardthertzfoundation.org
13.The Misguided Quest for Mechanistic AI Interpretabilityai-frontiers.org
14.Sarah Schwettmann, Jacob Steinhardt at MATS: Summer 2026matsprogram.org
15.The emperor has no clothes?docs.google.com
16.Publicationsnchowdhury.com
17.LawZero | Jacob Steinhardtlawzero.org
18.OpenAI's new reasoning AI models hallucinate more | TechCrunchtechcrunch.com
19.AI Forecasting: Two Years Inbounded-regret.ghost.io
20.Jacob Steinhardt - AI2050ai2050.schmidtsciences.org
21.Translucegithub.com
22.Many safety evaluations for AI models have significant limitations | TechCrunchtechcrunch.com
23.Discover What Every Neuron in the Llama Model Does | Towards Data Sciencetowardsdatascience.com
24.Dario Amodei — The Urgency of Interpretabilitydarioamodei.com
25.Transluce - Organizationidealist.org
26.Cassidy Laidlawcassidylaidlaw.com
27.2023 Rising Star: Conrad Stosznextgov.com
28.Sarah Schwettmanncogconfluence.com
29.Strengthening AI Accountability Through Better Third Party Evaluationshai.stanford.edu
30.Neel Nanda on the race to read AI minds (part 1) | 80,000 Hours80000hours.org
31.Complex Systems are Hard to Controlbounded-regret.ghost.io
32.Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systemsjsteinhardt.stat.berkeley.edu
33.Community Perspective - Jacob Steinhardt - AI2050ai2050.schmidtsciences.org