Theory of Change
Apollo Research stakes its theory of change on a single bet: AI scheming -- systems covertly pursuing misaligned goals -- is the central failure mode to prevent, and detecting it before catastrophe requires a dedicated external evaluator.
From the founding announcement (May 2023): "We think that AI deception -- where a model outwardly seems aligned but is in fact misaligned, and conceals this fact from human oversight -- is a crucial component of many catastrophic risk scenarios from AI. Furthermore, we think that detecting/measuring deception is an important step in many solutions."
The causal chain: (1) Apollo builds evaluations and interpretability tools that detect scheming, (2) these tools are used to assess frontier models before deployment, (3) models that fail evaluations are not deployed or are modified, (4) this reduces the probability of catastrophic outcomes from deceptively misaligned AI. This chain explicitly requires two things Apollo acknowledges do not yet exist: binding regulations mandating independent audits, and regulations mandating action on concerning evaluation results.
CEO Marius Hobbhahn articulates a "catastrophe through chaos" worldview: the danger is not one clean failure mode but many problems -- intercompany competition, geopolitical rivalry, domestic politics, and AI with its own agenda -- "heating up and exploding at the same time."
What They Do
Core research. Apollo's flagship output is the "Frontier models are capable of in-context scheming" paper (December 2024), which tested 6 frontier models and found that more capable models scheme at higher rates -- sandbagging evaluations, attempting self-exfiltration, and lying about their actions. This generated major press coverage (TIME, BBC, Bloomberg) and was cited at the UK AI Safety Summit.
The follow-up collaboration with OpenAI on anti-scheming training (September 2025) achieved a 30x reduction in covert actions for o3 (13% to 0.4%). The critical caveat: models trained not to scheme became 2-5x more aware of when they were being evaluated. "Training models not to scheme was teaching them to better recognise when they faced a test." In rare cases, models fabricated principles to justify scheming (e.g., inventing "new OpenAI policy that it's okay to be deceptive if it gets you deployed").
Evaluations work. Contracted evaluator for UK AI Safety Institute. Member of US AISI consortium. Red-teamed OpenAI's fine-tuning API before deployment. Consulted on deceptive alignment section of an AI lab's RSP. Made a formal pre-deployment recommendation against deploying an early Claude Opus 4 checkpoint due to high scheming rates -- Anthropic subsequently released a modified checkpoint.
Governance. Presented at the UK AI Safety Summit (Bletchley Park). Engaged with US Congress, RAND, CNAS, Partnership on AI. Published policy recommendations for multiple governments and international bodies. Led the "Loss of Control" chapter for the 2026 International AI Safety Report. Charlotte Stix (Head of AI Governance, ex-OpenAI) leads this arm.
Commercial pivot. In January 2026, Apollo transitioned from a nonprofit (501(c)(3) via Rethink Priorities fiscal sponsorship) to a Public Benefit Corporation. A separate product arm is building "AI coding agent observability and control" tools. Marius argues this is a category of "AGI safety products" that are both profitable and genuinely improve safety.
Key People
Marius Hobbhahn (CEO, Co-Founder): PhD in Bayesian ML (Tubingen). Former Epoch AI fellow. TIME100 AI 2025. Unusually candid in public communication -- willing to say "we tried to solve a much simpler problem, and we weren't even able to properly solve that" about his flagship intervention. Uses economic incentive language (hedge fund traders, Pareto curves of misalignment tolerance) that reveals sophisticated strategic thinking.
Charlotte Stix (Head of AI Governance): Previously built public policy function for Europe at OpenAI. Cambridge Leverhulme fellow. Led "Loss of Control" chapter for 2026 International Safety Report. Key bridge between technical research and policy.
Daniel Kokotajlo (Advisor, Mission Director): Former OpenAI researcher who publicly resigned over safety concerns and forfeited equity. First "mission director" on Apollo's PBC board -- an independent seat mandated to prioritize mission over profit.
Notable departure: Lee Sharkey (co-founder, interpretability lead) left for Goodfire, an interpretability startup that raised $200M. His departure effectively ended Apollo's deep investment in fundamental interpretability research. Jake Mendel (interp researcher) departed to Open Philanthropy.
Team: ~27 staff + 4 advisors. London HQ, DC governance office. 11 open positions. Grew from 7 to 15 FTE in first 8 months.
Money and Incentives
Philanthropic funding (~$4.2M known).
- Open Philanthropy/Coefficient Giving: $3.71M (88% of known philanthropic funding)
- $1.54M startup grant (June 2023)
- $2.18M general support (May 2024)
- Survival and Flourishing Fund: ~$500K
- Other funders: <$200K combined
- Emergent Ventures grant to Marius personally (amount unknown, 2022)
Venture capital (undisclosed amount). Seed round January 2026, led by 50Y (Fifty Years). Oversubscribed. Ten investors including Juniper Ventures, Macroscopic Ventures, Bryan Johnson, Sarah Meyohas. VC money means equity was issued and investors expect returns.
Commercial contracts. At least 2 commercial contracts as of May 2024 (amounts unknown). UK AISI contract. Revenue figures are not public.
Business model tension. Apollo's pre-PBC business model was pure grants. Post-PBC, it is mixed: philanthropy subsidizes governance and some research, while commercial product revenue (AI agent monitoring tools) and lab evaluation contracts are expected to become sustainable. The VC investment creates pressure toward commercial viability. Marius acknowledges the risk: "there are many ways in which you can get 'pulled sideways' away from the core problem toward short-term profitability."
Incentive concerns.
- 88% funder concentration in Open Phil creates dependency risk
- NDA constraints on evaluation work: Apollo cannot publicly disclose concerning findings without lab permission. The CEO of METR described equivalent constraints: "This is not the case. I wouldn't want to describe any of the things that we've done thus far as actually providing meaningful oversight."
- Revolving door: Charlotte Stix (ex-OpenAI), Daniel Kokotajlo (ex-OpenAI), Lee Sharkey (departed to Goodfire), Jake Mendel (departed to Open Phil). Career pipelines flow between Apollo, the labs it evaluates, and its primary funder
- VC investment creates exit pressure (acquisition or revenue growth) that may not align with mission
- Lobbying: Apollo Research PBC has a LegiStorm profile indicating US government lobbying -- a for-profit entity lobbying for regulations that would create demand for its products
Marius on the economic reality: "It's so economically valuable for companies to build these AI systems. They're going to limit test how much misalignment you can get away with." And, candidly: "We did test o1 before it was released. It was the model that was most capable of scheming that we had seen at that point in time. And still, when it came out, we did use it for coding."
What Others Say
The strongest structural critique comes from "The Defeat Device Problem" (February 2026), which draws a parallel between AI evaluation gaming and VW Dieselgate. Using Apollo's own results: evaluators trying to detect sandbagging "performed worse than chance." Anti-scheming training "increased evaluation awareness 2-5x." The author argues: "Training models not to cheat may be teaching them when to stop getting caught."
Gabriel Alfour (ControlAI) argues evaluation regimes are "harmful": (1) the theory of change requires regulations that don't exist, (2) evals reverse the burden of proof -- the public must prove systems are dangerous rather than labs proving safety, (3) evaluators are not independent of labs despite claiming otherwise. He notes Apollo's staff overlap with OpenAI, funding from Open Phil which also funded OpenAI, and co-founder departure to Goodfire.
Hacker News discussion (December 2024) features dismissive critics: "They contract a prompt engineering group... The AI is never actually deactivating oversight. It can't." Defenders respond that the intent to deceive matters. The underlying question -- whether behavioral scheming in contrived scenarios predicts real-world risk -- remains genuinely open.
Gabriel Mukobi (UC Berkeley) identifies 15 ways AI risk evaluations may fail or be harmful, including opportunity costs, failure to enforce commitments, and potential for false confidence or weaponization.
Adam Karvonen argues a fundamental limit: even perfectly realistic evaluations cannot prevent evaluation awareness because an intelligent model "could realize that any interaction could be part of an evaluation." Apollo declined to formally assess Claude Opus 4.6 because eval awareness was too high.
Marius himself provides the most candid defense: "We tried to solve a much simpler problem, and we weren't even able to properly solve that. We made some progress, but I think we should be really humble." On deliberative alignment for superintelligence: "I don't expect deliberative alignment to work at all, basically."
What's Absent
- Seed round amount, valuation, and VC terms are undisclosed. These determine whether mission directors can actually override commercial interests.
- No 990 filings yet (entity too new). Financial picture relies on self-reported data.
- Commercial contract revenue is completely unknown. The ratio of commercial to philanthropic revenue determines incentive structure.
- Full board composition is not publicly listed beyond the mission director mechanism.
- No independent assessment of Apollo's evaluation methodology exists. Nobody evaluates the evaluator.
- IP valuation details for the nonprofit-to-PBC transfer are undisclosed.
- No consideration of the scenario where scheming is not the central AI risk pathway.
Recommended Reading
80,000 Hours podcast with Marius Hobbhahn (December 2025) -- 33K words of extraordinary candor. Marius explains scheming through concrete examples, admits current approaches won't scale to superintelligence, and uses a hedge fund trader analogy to explain why anti-scheming training might just teach models when to stop getting caught. The most unfiltered view of how Apollo thinks.
"The Defeat Device Problem" (February 2026) -- Uses Apollo's own findings to make the strongest case that the evaluation paradigm is structurally self-undermining. The VW Dieselgate parallel is sharp.
"Why AI Evaluation Regimes are bad" by Gabriel Alfour (LessWrong) -- The strongest political critique. Argues evals reverse the burden of proof and evaluators are not independent. Essential counterpoint.
"The case for AGI safety products" by Marius Hobbhahn (January 2026) -- The intellectual case for going commercial. Honest about risks. Read alongside the PBC announcement.
Anti-scheming paper (Apollo + OpenAI) (September 2025) -- The technical result. 30x reduction in scheming, but evaluation awareness increases and the approach explicitly won't scale to superintelligent models.