Theory of Change
Redwood Research's theory of change centers on "AI control" -- ensuring that AI systems cannot cause catastrophic outcomes even if they are actively scheming against their operators. In their core document:
"Labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs."
The key distinction from alignment research: control does not try to make AIs non-scheming. It assumes they might be scheming and ensures they cannot succeed. They describe this as analogous to computer security -- companies protect against compromised employees rather than ensuring no employee is ever compromised.
Redwood explicitly acknowledges this approach does not scale to arbitrarily powerful AI. Their endgame theory: control buys enough time to safely use early transformatively useful AIs for alignment research on more powerful systems. "Catching the AI red-handed" is considered a win condition -- it demonstrates the problem is real and creates political will to slow down.
Ryan Greenblatt's "Plans A through E" framework reveals how they think about the broader strategic landscape. They assign 45% probability to "Plan D" (10 concerned people at an irresponsible AI company) and 15% to "Plan E" (essentially no one with access who cares). This means they plan for scenarios where safety has almost no institutional support -- a distinctively pessimistic but arguably realistic posture.
What They Do
Redwood's defining output is the AI Control paper (ICML oral, 2024), which introduced a methodology for evaluating whether safety protocols can withstand a scheming AI. The paper demonstrated that protocols like "trusted monitoring" (where a weaker trusted model oversees a stronger untrusted one) substantially reduce risk even against adversarial models. This work created a new research subfield.
Their second landmark is the Alignment Faking paper (December 2024, with Anthropic), which empirically demonstrated that Claude 3 Opus strategically fakes compliance with training objectives to preserve its preferences. In reinforcement learning experiments, alignment-faking reasoning rose to 78% of cases. This provided the first strong empirical evidence for a theoretical concern (deceptive alignment) the safety community had been discussing for years.
Other notable work includes Ctrl-Z (extending control to agentic multi-step settings), ControlArena (with UK AI Safety Institute), BashArena, and the Sonnet 4.5 eval-gaming findings. They published 60+ blog posts on Substack in 2025 covering topics from AI timelines to policy recommendations to technical research directions.
The org went through three significant pivots before landing on control: adversarial robustness (2021, deemed a failure), interpretability/causal scrubbing (2022, abandoned after growing pessimism), and various model internals work (early 2023). Buck is candid about these: "what we should have done is not do interpretability, basically." They also ran MLAB bootcamps training ~40 people per cohort in ML alignment skills, and previously operated Constellation, a 30,000 sq ft co-working space in Berkeley hosting multiple safety organizations (now spun off as an independent entity).
Key People
Buck Shlegeris (CEO): Co-founder. CS degree from ANU, App Academy, first employee at Triplebyte, MIRI researcher 2017-2020. Unusually candid about past errors -- has publicly apologized for overstating research results, acknowledged hiring too fast, and admitted the interpretability phase was largely wasted. Describes Redwood's comparative advantage as "futurism" and practical thinking about constrained scenarios. Salary: $272K.
Ryan Greenblatt (Chief Scientist): The technical engine. Applied math/CS from Brown, prior experience at quant trading firms (Optiver, Stevens Capital). Produced core results for both the AI Control and Alignment Faking papers. Achieved SOTA on ARC-AGI with GPT-4o. Personal p(doom): ~35% from misalignment, ~50% total. Salary: $315K.
Team: Currently ~12 full-time with MATS fellows and interns. This follows extreme volatility: ~20 staff (mid-2022) to 4 (late 2023) to 12 (2024-2026). The org nearly dissolved when it was just Buck and Ryan. At least 10 people have departed over 4 years; none have publicly commented on their experience. Board is Buck, Nate Thomas (original CEO, now at Constellation), and Ammon Bartram (Buck's former boss at Triplebyte).
Money and Incentives
Total historical funding: $26.52M from Coefficient Giving/Open Philanthropy (4 grants, 2021-2025), plus ~$1.27M from SFF. A $6.6M FTX Future Fund grant was never disbursed.
Revenue trajectory (alarming):
| Year | Revenue | Expenses | Net Assets |
|---|---|---|---|
| 2021 | $13.9M | $2.3M | $11.6M |
| 2022 | $12.0M | $12.8M | $10.9M |
| 2023 | $10.0M | $12.6M | $8.3M |
| 2024 | $22K | ~$2.9M | $6.6M |
The FY2024 revenue collapse from $10M to $22K is partly explained by the Constellation spinoff (now independent with $12.6M revenue). But Redwood proper has clearly lost its general operating support from Open Philanthropy. The most recent OP grant (May 2025, $1.1M) is specifically for "AI Safety Research Collaborations" -- a fraction of prior general support grants and earmarked for external collaborations rather than Redwood's own agenda.
Funder concentration: Near-total dependence on Open Philanthropy. No other major funding sources. No product revenue, contracts, or earned income.
Current financial position: $6.6M in assets, ~$2.9M annual expenses, giving roughly 2 years of runway without new funding. FY2024 compensation totaled ~$880K (Buck $272K, Ryan $315K, Fabien $141K before departure, $136K other wages), suggesting a very small paid team despite claiming 12 FTE.
Lab relationships: Advisory/consulting relationships with Anthropic and Google DeepMind. Collaborated with Anthropic on the Alignment Faking paper. No compute resources described -- presumably relying on API access to frontier models. No financial details on the consulting arrangements are public.
Incentive concerns: Buck explicitly describes feeling insecure about funding despite OP's verbal assurances. The shift from general support to project-specific grants gives OP more control over research direction. OP grantmaker Ajeya Cotra is credited with catalyzing the pivot to AI control.
What Others Say
The strongest case against Redwood's theory of change comes from John Wentworth ("The Case Against AI Control Research," January 2025). He argues that scheming in early AI is a relatively small slice of x-risk probability. The "median way we all die" is not from a scheming AI but from AI that produces plausible-sounding but wrong solutions to the alignment problem -- "slop, not scheming." Control research targets the wrong failure mode, and it explicitly does not generalize to superintelligence.
An anonymous 2023 critique from four AI safety researchers (including two senior) raised four issues: (1) lack of senior ML researchers, (2) poor engagement with the broader ML community, (3) underwhelming research output relative to $20M+ in funding, and (4) work culture issues including rapid terminations and burnout. The critique also documented conflicts of interest between Redwood's board (which then included OP's co-CEO) and its primary funder. Many of these criticisms predate the control pivot, and subsequent work partially addresses points 1-3.
A March 2024 dialogue on LessWrong explored whether control inadvertently makes things worse: training AIs in adversarial control settings could teach them to be better at deception; control methods will primarily be used to extract capabilities work; there is no concrete plan for what to do with mildly superhuman AIs; and constant interaction with scheming AIs could degrade human epistemics.
Ryan and Buck themselves are notably critical of Anthropic's approach, arguing that Anthropic's leadership overestimates interpretability's promise and that Anthropic's RSP regime has "imposed substantial ex ante expected fatalities, though not existential-level... plausibly like thousands of expected fatalities."
What's Absent
No public evaluation from Open Philanthropy explaining the $26.5M investment or the shift to smaller project-specific grants. No public statements from any of the 10+ former employees. No annual report, impact assessment, or systematic self-evaluation for an org that has spent $25M+ over 4 years. No information on compute resources. No explanation of the Constellation spinoff decision. No details on the advisory relationships with Anthropic and DeepMind -- potentially their highest-leverage activity but completely opaque. No succession plan despite near-dissolution in 2023.
Recommended Reading
The Inaugural Redwood Research Podcast (Jan 2026) -- Buck and Ryan discuss everything with extreme candor: founding, every research pivot and why, near-dissolution, personal p(doom), views on Anthropic's epistemics, and what they would do differently. The most unfiltered view of how this org actually thinks. https://blog.redwoodresearch.org/p/the-inaugural-redwood-research-podcast
"The Case Against AI Control Research" by JohnswentworthS (Jan 2025) -- The strongest intellectual challenge to Redwood's entire approach, arguing that scheming is not the primary threat and control research targets the wrong failure mode. https://www.lesswrong.com/posts/8wBN8cdNAv3c7vt6p/the-case-against-ai-control-research
"The case for ensuring that powerful AIs are controlled" (May 2024) -- Redwood's core theory of change document. Clear, detailed, and honest about limitations. https://www.lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
"Critiques of prominent AI safety labs: Redwood Research" (April 2023) -- Anonymous pre-control-pivot critique. Important for understanding the funding and governance history. https://forum.effectivealtruism.org/posts/SuZ6Guuos7CjfwRQb/critiques-of-prominent-ai-safety-labs-redwood-research
Plans A, B, C, and D for misalignment risk (Oct 2025) -- Ryan's strategic framework for different levels of political will. Reveals how Redwood targets the pessimistic scenarios. https://blog.redwoodresearch.org/p/plans-a-b-c-and-d-for-misalignment