← AI Safety Orgs

MIT Algorithmic Alignment Group

Research

Hadfield-Menell. CIRL. Academic.

Founded
2021
HQ
Cambridge, MA
Team
25
Structure
university-affiliated
Model
Grants

Theory of Change

Dylan Hadfield-Menell's stated theory of change has two parts. The first, from his foundational work, is that assistance games provide an analytical framework for understanding value alignment -- they "identify areas of research that are valuable" and "shape rough abstract ideas of what a safe interaction with advanced systems would look like." He is explicit that this work "will not meaningfully reduce x-risk on its own" because "the types of Bayesian inference that we can do right now are not up to par with systems that would be effective enough to pose a threat."

The second part, more recent, comes from the group's most prominent PhD student, Stephen Casper, who argues that AI safety is "a neverending institutional challenge" analogous to nuclear risk: "The technical 'nuclear alignment' problem is virtually solved... for the foreseeable future, we will be in perpetual danger from nuclear warfare -- not for technical reasons, but institutional ones." Casper concludes that "the core priorities of the AI safety community should be to build strong institutions, increase transparency and accountability, and improve incident preparedness."

DHM's own worldview is skeptical of intelligence explosion: "I tend to think that the idea of an 'intelligence explosion' is a good deal more questionable than I.J. Good does. This is because I think that 'intelligence' is a complicated property with multiple competing dimensions." He views recursive improvement as occurring across organizations, not within a single entity.

The group's research has visibly pivoted: from foundational assistance game theory (2016-2020, Berkeley era) toward practical auditing tools, policy engagement, and interpretability (2021-present, MIT era).

What They Do

Foundational theory (Berkeley era, 2016-2020): CIRL (NeurIPS 2016) formalized assistance games. The Off-Switch Game (IJCAI 2017) proved that uncertainty about objectives creates incentives for corrigibility. Inverse Reward Design (NeurIPS 2017) reframed reward functions as observations, not ground truth. Consequences of Misaligned AI (NeurIPS 2020) showed that proxy optimization eventually drives utility unbounded negative.

Auditing and evaluation (MIT era, 2023-present): "Open Problems and Fundamental Limitations of RLHF" (TMLR 2023, 250+ paper survey, Outstanding Paper Finalist) classified RLHF problems as tractable vs. fundamental. "Black-Box Access is Insufficient for Rigorous AI Audits" (FAccT 2024) argued meaningful audits require white-box access. The AI Agent Index (2025) documented safety features of 30 AI agents, finding 25/30 disclose no internal safety results. Model Tampering Attacks (TMLR 2025) introduced "generalized adversarial attacks" for more conservative safety estimates.

Policy engagement: "Pitfalls of Evidence-Based AI Policy" (ICLR 2025) drew parallels between "evidence-based AI policy" rhetoric and tobacco industry delay tactics, proposing 15 process regulations. The paper revealed that 5/17 authors of a prominent "evidence-based policy" paper had undisclosed industry affiliations. Casper co-wrote the International AI Safety Report and the Singapore Consensus alongside Bengio, Russell, and Tegmark.

Technical alignment (ongoing): Latent Adversarial Training for persistent harmful behaviors (2024-2025). Diverse Preference Learning / Soft Preference Learning fixes RLHF-induced diversity reduction (ICLR 2025). Open-Universe Assistance Games extends the framework to natural language (2025). "Randomness, Not Representation" debunked cultural alignment evaluations (FAccT 2025).

Field building: MAIA (MIT AI Alignment) student group runs reading groups and research partnerships. CAMBRIA is a 3-week ML bootcamp based on ARENA curriculum. Both supported by CBAI, a 501(c)(3).

DHM's December 2024 Medium post publicly criticized AI labs for packaging safety research as marketing, calling Constitutional AI evaluation "weak" and criticizing "lazy use of LLM-as-a-judge evaluations."

Key People

Dylan Hadfield-Menell -- Associate Professor, MIT EECS/CSAIL. PhD Berkeley with Stuart Russell, Anca Dragan, Pieter Abbeel. Created the assistance games framework. Schmidt AI2050 Fellow. Research now spans SAE interpretability, in-context alignment, and democratic institutions for AI oversight. Willing to publicly criticize frontier labs' evaluation standards.

Stephen Casper -- Final-year PhD student, by output arguably the group's most impactful member. 15+ papers, policy work on international safety reports, MATS mentor, Berkman Klein Fellow, FLI fellow. Frames safety as institutional rather than technical. Led the RLHF survey, black-box audits paper, AI Agent Index, and evidence-based policy critique.

Gillian Hadfield (collaborator, DHM's mother) -- Bloomberg Distinguished Professor of AI Alignment and Governance at Johns Hopkins. Former OpenAI Senior Policy Adviser (2018-2023). Co-authored Incomplete Contracting and AI Alignment with DHM.

Team size: ~25 (1 PI, 2 postdocs, 1 research scientist, 4 PhD students, 15+ master's students/alumni). First PhD graduate: Andreas Haupt (2024), now Stanford HAI postdoc.

Money and Incentives

AAG is a university lab within MIT CSAIL -- no independent legal entity, no 990, no public budget.

Known external funding:

  • Coefficient Giving/Open Phil: $30,000 via BERI over 3 years (2024) -- strikingly small
  • CBAI (field-building arm): $756,000 from CG/OP (2023) for MAIA support and office space
  • Schmidt Sciences AI2050 fellowship: ~$643K estimated (from $18M / 28 fellows cohort)
  • FLI Vitalik Buterin Fellowship: supports Casper (amount unknown)
  • MIT base: faculty salary, named professorship (Tenenbaum Career Development Professor)
  • Federal grants (NSF, DARPA): almost certainly exist but not publicly documented

Context: Of 20 CG/OP grants to MIT totaling $37.7M, only $30K is to AAG and $756K to CBAI for MAIA. The rest ($36.9M) goes to entirely separate MIT labs: Neil Thompson ($20.3M for AI trends), Madry ($1.43M for adversarial robustness), Esvelt ($4.8M for biosecurity), Boyden ($6M for neurobiology).

Incentive assessment: No evidence of industry consulting, compute partnerships, or commercial revenue. DHM does not have documented economic ties to frontier labs that would create incentive misalignment. The university structure provides baseline independence. The lab's willingness to criticize Anthropic's evaluation standards and call out industry researchers' conflicts of interest is consistent with genuine independence. The main risk is the standard academic incentive toward publishable results over practical safety impact.

What Others Say

Self-critique (DHM on his own framework): "Taking current AI technology and merely saying, 'I am designing my system as the solution to an assistance game of some kind,' will not meaningfully reduce x-risk on its own... the types of Bayesian inference that we can do right now are not up to par." This is remarkable honesty about the limitations of one's own core contribution.

CHAI approach limitations: The Founders Pledge review of CHAI (DHM's PhD lab) notes CIRL is "one approach within CHAI, not the whole agenda." Broader critiques include: assistance games rely on assumptions (known prior structure, rational human behavior) that break in practice; deep learning systems don't decompose into clean game-theoretic structures; and the gap between theoretical insights and deployed systems remains wide.

Industry delay parallel: Casper argues that labs framing alignment as primarily technical is "convenient" for companies who "say will be needed to prevent AI catastrophe would potentially give them trillions of dollars and enough power to rival the democratic institutions meant to address misuse and systemic risks."

Dual-use acknowledgment (DHM): "Effective alignment with concentrated power distributions is a recipe for some pretty bad situations. Single agent value alignment on its own really just allows people to get systems to do what they want more effectively."

Limited external criticism of AAG specifically. No identified critic specifically targets AAG's approach or impact. This likely reflects that the group is small enough and its claims modest enough that it doesn't attract focused dissent.

What's Absent

  • Total lab budget -- opaque due to university structure
  • Federal grants -- NSF/DARPA funding almost certainly exists but is undocumented
  • Student placement data -- only 1/~20 graduates' destinations are confirmed
  • Post-Casper strategy -- how the lab maintains policy influence after its most prolific student leaves
  • Frontier model access -- no documented compute partnerships or lab collaborations
  • Government advisory roles -- DHM has none, unlike peers Russell or Bengio
  • CBAI governance -- no 990 found for the 501(c)(3) managing $756K

Recommended Reading

  1. AXRP Episode 8: Assistance Games with Dylan Hadfield-Menell (2021) -- The most candid source. DHM honestly discusses what his framework can and cannot do for x-risk, admits it won't directly reduce existential risk, and explains the theoretical foundations. Start here for understanding how AAG's PI actually thinks. https://axrp.net/episode/2021/06/08/episode-8-assistance-games-dylan-hadfield-menell.html

  2. Casper, "Reframing AI Safety as a Neverending Institutional Challenge" (2025) -- The strongest counterargument to purely technical alignment approaches, from within AAG itself. Nuclear power analogy is compelling. https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/

  3. Pitfalls of Evidence-Based AI Policy (ICLR 2025) -- The group's boldest policy output. Draws tobacco industry parallels, names industry conflicts of interest, proposes 15 process regulations. https://iclr-blogposts.github.io/2025/blog/pitfalls-of-evidence-based-ai-policy/

  4. Stephen Casper on Technical and Sociotechnical AI Safety (CAIP Podcast, 2024) -- Best overview of Casper's views on interpretability standards, RLHF limitations, audit incentive problems, and why he's shifting from technical to sociotechnical work. https://aipolicypod.substack.com/p/10-stephen-casper-on-technical-and

Show Claude’s analysis
An opinionated read. Read the brief first to form your own view.

Stated Theory of Change

AAG operates on two levels. The explicit theoretical contribution is that assistance games provide an analytical framework for studying value alignment: they allow researchers to "identify areas of research that are valuable" and "shape rough abstract ideas of what a safe interaction with advanced systems would look like." DHM is careful to note these are analytical tools, not deployable solutions.

The implicit theory of change is broader: by training interdisciplinary researchers at MIT, publishing in top venues, engaging with policy, and building student pipelines through MAIA, AAG shapes how the field thinks about alignment -- moving it from purely technical optimization toward economic, legal, and institutional framings. The Incomplete Contracting paper with Gillian Hadfield, the multi-stakeholder alignment work, and the policy engagement all serve this agenda.

Casper's theory of change is somewhat at odds with the lab's core research: he argues that technical alignment solutions are insufficient and potentially counterproductive, and that safety requires building permanent institutions. This tension is productive -- it pushes the group toward policy engagement and sociotechnical work that complements the theoretical research.

Revealed Theory of Change

Actions largely match stated intentions, with an interesting evolution. The Berkeley-era work (CIRL, Off-Switch Game, IRD) was primarily theoretical and aimed at the academic AI community. The MIT-era work splits into:

  1. Practical auditing and evaluation (Casper-driven): RLHF survey, black-box audits paper, AI Agent Index, model tampering attacks. These directly serve the institutional safety agenda.
  2. Policy engagement (DHM + Casper): evidence-based policy critique, international safety reports, TechPolicy.Press articles, public criticism of lab evaluation standards.
  3. Interpretability and alignment methods (newer direction): SAE tools, in-context alignment, diverse preference learning.
  4. Field building: MAIA, CAMBRIA bootcamp, mentorship (Pivotal, MATS, ERA, GovAI).

The shift from pure theory to practical tools + policy is genuine and visible. The group appears to have updated on the limitations of foundational theory and pivoted toward having impact through auditing infrastructure and governance.

One notable gap: despite Casper's institutional framing, the group hasn't built or proposed specific institutional structures. The "Pitfalls" paper proposes 15 regulatory goals but doesn't establish or design the implementing institutions. This is idea generation, not institution building.

Key Assumptions

A1: Analytical frameworks transfer to practice. The assumption that theoretical insights about assistance games, inverse reward design, and incomplete contracting will eventually inform how real systems are built. Evidence for: some concepts (uncertainty about objectives, reward function as observation) have influenced how the field thinks. Evidence against: no AAG formalism has been directly implemented in a deployed system at scale.

A2: Academic influence on policy matters. The assumption that publishing in FAccT, writing international safety reports, and critiquing evidence-based policy will translate to better regulation. Evidence for: the AI Agent Index generated significant media coverage; Casper is embedded in international policy processes. Evidence against: regulation has mostly progressed independently of academic alignment research; the ICLR policy blog hasn't obviously influenced any legislation.

A3: Multi-stakeholder alignment is tractable. DHM's argument that aligning AI with a single actor is dangerous and multi-stakeholder alignment is needed. This is a harder problem than single-principal alignment. If multi-stakeholder alignment is intractable at scale, the theoretical agenda hits a dead end. Testable: does the group produce algorithms that actually aggregate diverse preferences better than existing methods?

A4: The pipeline matters more than the papers. If AAG's graduates carry alignment-aware thinking into frontier labs, policy positions, and other research groups, the indirect impact through people may exceed the direct impact of publications. Currently untestable: student placement data is almost entirely missing.

A5: Slow takeoff and distributed AI development. DHM's skepticism about intelligence explosion means he expects AI risk to unfold gradually, making institutional solutions viable. If sharp takeoff occurs, AAG's long-term institution-building approach may be too slow. This is a deep worldview bet that shapes everything.

Strengths

Intellectual honesty. DHM's candor about his own framework's limitations -- "will not meaningfully reduce x-risk on its own" -- is rare in alignment research. Casper's willingness to say labs use alignment framing to justify power concentration is equally unusual. This honesty makes their analysis more credible.

Independence from frontier labs. No documented industry consulting, compute partnerships, or financial ties that would compromise judgment. DHM's public criticism of Anthropic's evaluation standards and Casper's critique of lab alignment narratives demonstrate real independence.

Interdisciplinary breadth. The team spans philosophy, economics, law, policy, and ML. The Incomplete Contracting paper, the multi-stakeholder alignment framing, and the parallels to tobacco industry regulation all reflect intellectual range beyond the typical CS-only alignment lab.

Practical policy products. The AI Agent Index, black-box audits paper, and ICLR policy blog are concrete contributions to AI governance infrastructure. These bridge theory and policy more effectively than most academic outputs.

Institutional positioning. MIT's brand legitimacy gives AAG's outputs credibility with policymakers and mainstream AI researchers who might dismiss work from less prestigious institutions.

Casper as bridge figure. Working simultaneously on technical safety (LAT, adversarial training), policy (international reports), and governance (audit standards) -- an unusual combination that connects otherwise siloed communities.

Weaknesses and Risks

Casper dependency. Much of AAG's visible impact -- the RLHF survey, AI Agent Index, policy engagement, international safety documents -- flows through a single PhD student. Post-Casper, the group's policy influence may sharply contract.

Theory-practice gap. No AAG formalism has been implemented in a deployed system. The assistance games framework is analytically useful but has not produced engineering tools that practitioners use. The pivot toward interpretability and auditing tools may address this, but it's early.

Small funding footprint. The $30K CG/OP grant is strikingly small. Whether this reflects CG/OP's assessment of AAG's impact or simply misalignment between funding channels, it suggests the group is not viewed as a priority investment by the largest AI safety funder.

Missing student data. Without knowing where graduates go, we cannot assess the pipeline's effectiveness. If MIT master's students who rotated through AAG end up at frontier labs doing capabilities work, the net effect could be negative.

No frontier access. Academic compute constraints and no documented lab partnerships mean AAG's technical work may not generalize to frontier-scale systems. LAT and model tampering are demonstrated on smaller models.

Institutional proposals without institutional construction. Casper argues for building institutions but hasn't built one. The group critiques existing evaluation standards without establishing alternative institutions. There is a gap between diagnosis and implementation.

Cross-References

CHAI (Berkeley): AAG is CHAI's intellectual descendant. DHM trained under Russell, Dragan, Abbeel. The core research program (assistance games) originated at CHAI. AAG differs by (a) more emphasis on multi-stakeholder alignment, (b) stronger policy engagement, (c) less emphasis on human-robot interaction experiments. CHAI's more direct lab experiments complement AAG's more theoretical approach.

Frontier labs: AAG occupies a useful adversarial position -- willing to publicly critique labs' evaluation standards and framing. This is valuable precisely because many academic groups self-censor to maintain lab relationships.

Casper's broader network: Through MATS, ERA, GovAI mentorship, international safety reports, and UK AISI experience, Casper connects AAG to a much broader institutional network than a typical academic lab.

CBAI ecosystem: Connected to Max Tegmark (FLI founder), Harvard AISST, and the ARENA bootcamp community through CBAI.

What Would Change This Assessment

  • Evidence that AAG graduates systematically enter alignment-focused positions would significantly upgrade the pipeline theory of change.
  • A deployed system using AAG-developed methods (assistance game formalism, LAT, auditing tools) would demonstrate practical impact.
  • Post-Casper policy engagement maintained at current levels would show the institutional capacity isn't person-dependent.
  • DHM taking a formal advisory role with a government AI safety body or frontier lab would bridge the theory-practice gap.
  • A focused, well-argued critique of AAG's specific approach (not just general CHAI-style concerns) would help calibrate the assessment.
  • Evidence of significant undocumented funding (large DARPA grants, industry consulting) would change the independence assessment.

Self-Critique

Weakest claim: My assessment that AAG is genuinely independent from frontier labs rests partly on absence of evidence (no documented industry ties) rather than evidence of absence. There may be informal consulting arrangements, advisory boards, or compute-sharing agreements that don't appear in public sources.

Potential bias: I may be giving AAG credit for honesty about limitations that actually reflects a lack of ambition. An org that says "our work won't directly reduce x-risk" could be praised for candor or criticized for not aiming higher. My framing leans toward the former.

What I'd want a critic to check: Whether DHM's federal grant portfolio (invisible to me) reveals priorities that diverge from the safety-focused work visible in publications. NSF grants often require broader impact statements that may tell a different story about the lab's priorities.

Missing perspective: I haven't found a critic who specifically targets AAG's approach. The critiques I've identified are either self-generated or directed at the broader CHAI tradition. A thoughtful outsider critique of AAG specifically would be valuable for calibration.

Information I most want: (1) Complete list of AAG graduate career destinations. (2) DHM's full grant portfolio. (3) CBAI's 990 filing. (4) Any documentation of informal lab partnerships or computing arrangements.

Connected to (12)

Berkman Klein Centercollaborator · Stephen Casper
Cambridge Boston Alignment Initiativecollaborator
Center for Human-Compatible AIstaff from · Dylan Hadfield-Menell
Center for Human-Compatible AIstaff from · Stephen Casper
Centre for the Governance of AIadvisor at · Stephen Casper
ERA Cambridgeadvisor at · Stephen Casper
Google DeepMindstaff to
MATSadvisor at · Stephen Casper
Pivotal Researchadvisor at · Dylan Hadfield-Menell
Stanford HAIstaff to · Andreas Haupt
Gillian Hadfield / Johns Hopkinscollaborator · Dylan Hadfield-Menell
UK AI Safety Institutestaff from · Stephen Casper
Sources (61)
Every URL that was read during research.
  1. 1.Algorithmic Alignment Groupalgorithmicalignment.csail.mit.edu
  2. 2.Dylan Hadfield-Menellpeople.csail.mit.edu
  3. 3.Dylan Hadfield-Menell - AI2050ai2050.schmidtsciences.org
  4. 4.Dylan Hadfield-Menell — Cambridge Boston Alignment Initiativecbai.ai
  5. 5.We're a group of MIT students conducting research to reduce risk from advanced AI.aialignment.mit.edu
  6. 6.Community Perspective - Dylan Hadfield-Menell - AI2050ai2050.schmidtsciences.org
  7. 7.Dylan Hadfield-Menell, UC Berkeley/MIT: On the value alignment problem in AIimbue.com
  8. 8.8 - Assistance Games with Dylan Hadfield-Menellaxrp.net
  9. 9.Stephen Casperstephencasper.com
  10. 10.Cambridge Boston Alignment Initiativecbai.ai
  11. 11.#10: Stephen Casper on Technical and Sociotechnical AI Safety Researchaipolicypod.substack.com
  12. 12.Research — Cambridge Boston Alignment Initiativecbai.ai
  13. 13.Dylan Hadfield-Menell, Author at TechPolicy.Presstechpolicy.press
  14. 14.Red Teaming AI: The Devil Is In The Detailstechpolicy.press
  15. 15.Dylan Hadfield-Menell — Pivotalpivotal-research.org
  16. 16.Dylan Hadfield-Menell | MIT Sloan Executive Educationexecutive.mit.edu
  17. 17.CAMBRIA — Cambridge Boston Alignment Initiativecbai.ai
  18. 18.SRI Seminar Series: Dylan Hadfield-Menell, “You can’t have AI safety without inclusion” — Schwartz Reisman Institutesrinstitute.utoronto.ca
  19. 19.Toward a More Expansive Perspective on AI Safetyethicsinsociety.stanford.edu
  20. 20.Unknownpeople.csail.mit.edu
  21. 21.Dylan Hadfield-Menell - Future of Life Institutefutureoflife.org
  22. 22.Algorithmic Alignment Groupgithub.com
  23. 23.What is the MIT Algorithmic Alignment Group's research agenda?stampy.ai
  24. 24.Dylan Hadfield-Menelleecs.mit.edu
  25. 25.Dylan Hadfield-Menellengineering.mit.edu
  26. 26.Connor Coley, Dylan Hadfield-Menell named AI2050 Early Career Fellowseecs.mit.edu
  27. 27.MIT AI Alignmentmitalignment.org
  28. 28.Understanding the Effects of RLHF on LLM Generalisation and Diversityarxiv.org
  29. 29.Incomplete Contracting and AI Alignmentarxiv.org
  30. 30.Black-Box Access is Insufficient for Rigorous AI Auditsarxiv.org
  31. 31.Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMsarxiv.org
  32. 32.Defending Against Unforeseen Failure Modes with Latent Adversarial Trainingarxiv.org
  33. 33.Gillian Hadfield - Wikipediaen.wikipedia.org
  34. 34.Max Tegmark — Cambridge Boston Alignment Initiativecbai.ai
  35. 35.Center for Human-Compatible AIfounderspledge.com
  36. 36.Stephen Caspercyber.harvard.edu
  37. 37.Consequences of Misaligned AIarxiv.org
  38. 38.Dylan Hadfield-Menellknightcolumbia.org
  39. 39.Stephen Casper (Cas) at MATS: Summer 2026matsprogram.org
  40. 40.Reframing AI Safety as a Neverending Institutional Challengestephencasper.com
  41. 41.Pitfalls of Evidence-Based AI Policyarxiv.org
  42. 42.Cooperative Inverse Reinforcement Learningarxiv.org
  43. 43.Pitfalls of Evidence-Based AI Policyiclr-blogposts.github.io
  44. 44.Smokescreen: How Bad Evidence Is Used to Prevent AI Safetyai-frontiers.org
  45. 45.The 2025 AI Agent Indexaiagentindex.mit.edu
  46. 46.Algorithmic Alignment Groupalgorithmicalignment.csail.mit.edu
  47. 47.Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilitiesarxiv.org
  48. 48.Legible Normativity for AI Alignment: The Value of Silly Rulesarxiv.org
  49. 49.Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedbackarxiv.org
  50. 50.AI Alignment Podcast: Cooperative Inverse Reinforcement Learning with Dylan Hadfield-Menell (Beneficial AGI 2019) - Future of Life Institutefutureoflife.org
  51. 51.Programs — Cambridge Boston Alignment Initiativecbai.ai
  52. 52.Inverse Reward Designarxiv.org
  53. 53.Diverse Preference Learning for Capabilities and Alignmentarxiv.org
  54. 54.Get Involvedaialignment.mit.edu
  55. 55.CHAI PhD Students Accept Positions at MIT, Princeton, and DeepMindhumancompatible.ai
  56. 56.Dylan Hadfield-Menellcomputerhistory.org
  57. 57.Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMsarxiv.org
  58. 58.Dylan Hadfield-Menellapproximatelycorrect.com
  59. 59.Open-Universe Assistance Gamesarxiv.org
  60. 60.MIT AI Risk Repositoryairisk.mit.edu
  61. 61.The Need for Scientific Rigor in AI Safety Researchmedium.com