Theory of Change
Dylan Hadfield-Menell's stated theory of change has two parts. The first, from his foundational work, is that assistance games provide an analytical framework for understanding value alignment -- they "identify areas of research that are valuable" and "shape rough abstract ideas of what a safe interaction with advanced systems would look like." He is explicit that this work "will not meaningfully reduce x-risk on its own" because "the types of Bayesian inference that we can do right now are not up to par with systems that would be effective enough to pose a threat."
The second part, more recent, comes from the group's most prominent PhD student, Stephen Casper, who argues that AI safety is "a neverending institutional challenge" analogous to nuclear risk: "The technical 'nuclear alignment' problem is virtually solved... for the foreseeable future, we will be in perpetual danger from nuclear warfare -- not for technical reasons, but institutional ones." Casper concludes that "the core priorities of the AI safety community should be to build strong institutions, increase transparency and accountability, and improve incident preparedness."
DHM's own worldview is skeptical of intelligence explosion: "I tend to think that the idea of an 'intelligence explosion' is a good deal more questionable than I.J. Good does. This is because I think that 'intelligence' is a complicated property with multiple competing dimensions." He views recursive improvement as occurring across organizations, not within a single entity.
The group's research has visibly pivoted: from foundational assistance game theory (2016-2020, Berkeley era) toward practical auditing tools, policy engagement, and interpretability (2021-present, MIT era).
What They Do
Foundational theory (Berkeley era, 2016-2020): CIRL (NeurIPS 2016) formalized assistance games. The Off-Switch Game (IJCAI 2017) proved that uncertainty about objectives creates incentives for corrigibility. Inverse Reward Design (NeurIPS 2017) reframed reward functions as observations, not ground truth. Consequences of Misaligned AI (NeurIPS 2020) showed that proxy optimization eventually drives utility unbounded negative.
Auditing and evaluation (MIT era, 2023-present): "Open Problems and Fundamental Limitations of RLHF" (TMLR 2023, 250+ paper survey, Outstanding Paper Finalist) classified RLHF problems as tractable vs. fundamental. "Black-Box Access is Insufficient for Rigorous AI Audits" (FAccT 2024) argued meaningful audits require white-box access. The AI Agent Index (2025) documented safety features of 30 AI agents, finding 25/30 disclose no internal safety results. Model Tampering Attacks (TMLR 2025) introduced "generalized adversarial attacks" for more conservative safety estimates.
Policy engagement: "Pitfalls of Evidence-Based AI Policy" (ICLR 2025) drew parallels between "evidence-based AI policy" rhetoric and tobacco industry delay tactics, proposing 15 process regulations. The paper revealed that 5/17 authors of a prominent "evidence-based policy" paper had undisclosed industry affiliations. Casper co-wrote the International AI Safety Report and the Singapore Consensus alongside Bengio, Russell, and Tegmark.
Technical alignment (ongoing): Latent Adversarial Training for persistent harmful behaviors (2024-2025). Diverse Preference Learning / Soft Preference Learning fixes RLHF-induced diversity reduction (ICLR 2025). Open-Universe Assistance Games extends the framework to natural language (2025). "Randomness, Not Representation" debunked cultural alignment evaluations (FAccT 2025).
Field building: MAIA (MIT AI Alignment) student group runs reading groups and research partnerships. CAMBRIA is a 3-week ML bootcamp based on ARENA curriculum. Both supported by CBAI, a 501(c)(3).
DHM's December 2024 Medium post publicly criticized AI labs for packaging safety research as marketing, calling Constitutional AI evaluation "weak" and criticizing "lazy use of LLM-as-a-judge evaluations."
Key People
Dylan Hadfield-Menell -- Associate Professor, MIT EECS/CSAIL. PhD Berkeley with Stuart Russell, Anca Dragan, Pieter Abbeel. Created the assistance games framework. Schmidt AI2050 Fellow. Research now spans SAE interpretability, in-context alignment, and democratic institutions for AI oversight. Willing to publicly criticize frontier labs' evaluation standards.
Stephen Casper -- Final-year PhD student, by output arguably the group's most impactful member. 15+ papers, policy work on international safety reports, MATS mentor, Berkman Klein Fellow, FLI fellow. Frames safety as institutional rather than technical. Led the RLHF survey, black-box audits paper, AI Agent Index, and evidence-based policy critique.
Gillian Hadfield (collaborator, DHM's mother) -- Bloomberg Distinguished Professor of AI Alignment and Governance at Johns Hopkins. Former OpenAI Senior Policy Adviser (2018-2023). Co-authored Incomplete Contracting and AI Alignment with DHM.
Team size: ~25 (1 PI, 2 postdocs, 1 research scientist, 4 PhD students, 15+ master's students/alumni). First PhD graduate: Andreas Haupt (2024), now Stanford HAI postdoc.
Money and Incentives
AAG is a university lab within MIT CSAIL -- no independent legal entity, no 990, no public budget.
Known external funding:
- Coefficient Giving/Open Phil: $30,000 via BERI over 3 years (2024) -- strikingly small
- CBAI (field-building arm): $756,000 from CG/OP (2023) for MAIA support and office space
- Schmidt Sciences AI2050 fellowship: ~$643K estimated (from $18M / 28 fellows cohort)
- FLI Vitalik Buterin Fellowship: supports Casper (amount unknown)
- MIT base: faculty salary, named professorship (Tenenbaum Career Development Professor)
- Federal grants (NSF, DARPA): almost certainly exist but not publicly documented
Context: Of 20 CG/OP grants to MIT totaling $37.7M, only $30K is to AAG and $756K to CBAI for MAIA. The rest ($36.9M) goes to entirely separate MIT labs: Neil Thompson ($20.3M for AI trends), Madry ($1.43M for adversarial robustness), Esvelt ($4.8M for biosecurity), Boyden ($6M for neurobiology).
Incentive assessment: No evidence of industry consulting, compute partnerships, or commercial revenue. DHM does not have documented economic ties to frontier labs that would create incentive misalignment. The university structure provides baseline independence. The lab's willingness to criticize Anthropic's evaluation standards and call out industry researchers' conflicts of interest is consistent with genuine independence. The main risk is the standard academic incentive toward publishable results over practical safety impact.
What Others Say
Self-critique (DHM on his own framework): "Taking current AI technology and merely saying, 'I am designing my system as the solution to an assistance game of some kind,' will not meaningfully reduce x-risk on its own... the types of Bayesian inference that we can do right now are not up to par." This is remarkable honesty about the limitations of one's own core contribution.
CHAI approach limitations: The Founders Pledge review of CHAI (DHM's PhD lab) notes CIRL is "one approach within CHAI, not the whole agenda." Broader critiques include: assistance games rely on assumptions (known prior structure, rational human behavior) that break in practice; deep learning systems don't decompose into clean game-theoretic structures; and the gap between theoretical insights and deployed systems remains wide.
Industry delay parallel: Casper argues that labs framing alignment as primarily technical is "convenient" for companies who "say will be needed to prevent AI catastrophe would potentially give them trillions of dollars and enough power to rival the democratic institutions meant to address misuse and systemic risks."
Dual-use acknowledgment (DHM): "Effective alignment with concentrated power distributions is a recipe for some pretty bad situations. Single agent value alignment on its own really just allows people to get systems to do what they want more effectively."
Limited external criticism of AAG specifically. No identified critic specifically targets AAG's approach or impact. This likely reflects that the group is small enough and its claims modest enough that it doesn't attract focused dissent.
What's Absent
- Total lab budget -- opaque due to university structure
- Federal grants -- NSF/DARPA funding almost certainly exists but is undocumented
- Student placement data -- only 1/~20 graduates' destinations are confirmed
- Post-Casper strategy -- how the lab maintains policy influence after its most prolific student leaves
- Frontier model access -- no documented compute partnerships or lab collaborations
- Government advisory roles -- DHM has none, unlike peers Russell or Bengio
- CBAI governance -- no 990 found for the 501(c)(3) managing $756K
Recommended Reading
AXRP Episode 8: Assistance Games with Dylan Hadfield-Menell (2021) -- The most candid source. DHM honestly discusses what his framework can and cannot do for x-risk, admits it won't directly reduce existential risk, and explains the theoretical foundations. Start here for understanding how AAG's PI actually thinks. https://axrp.net/episode/2021/06/08/episode-8-assistance-games-dylan-hadfield-menell.html
Casper, "Reframing AI Safety as a Neverending Institutional Challenge" (2025) -- The strongest counterargument to purely technical alignment approaches, from within AAG itself. Nuclear power analogy is compelling. https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/
Pitfalls of Evidence-Based AI Policy (ICLR 2025) -- The group's boldest policy output. Draws tobacco industry parallels, names industry conflicts of interest, proposes 15 process regulations. https://iclr-blogposts.github.io/2025/blog/pitfalls-of-evidence-based-ai-policy/
Stephen Casper on Technical and Sociotechnical AI Safety (CAIP Podcast, 2024) -- Best overview of Casper's views on interpretability standards, RLHF limitations, audit incentive problems, and why he's shifting from technical to sociotechnical work. https://aipolicypod.substack.com/p/10-stephen-casper-on-technical-and