Theory of Change
ARG's theory of change evolved across three distinct phases, all articulated by founder Sam Bowman.
Phase 1 (Oct 2022, ARG founding): "We're making progress quickly but chaotically... the downside risk is also potentially catastrophic, and it doesn't look like we're prepared to manage that risk." The causal chain: NLP researchers with empirical skills should work on alignment-relevant problems -- debate, amplification, scalable oversight, chain-of-thought faithfulness -- producing evidence about whether these methods can bootstrap human oversight of superhuman systems. The "sandwiching paradigm" would let non-experts use AI tools to solve problems verified by experts, demonstrating that oversight scales.
Phase 2 (Sep 2024, "The Checklist"): Scalable oversight is "the central pillar" of alignment work. Three chapters track from current models through TAI to post-TAI. Safety cases, RSPs, automated evaluations, and external adjudication must all be in place before the transition to broadly superhuman systems. Bowman states directly: "Things are likely more urgent than they appear." He assumes TAI this decade.
Phase 3 (Apr 2025, "Putting Up Bumpers"): Accept that alignment won't be solved definitively before AGI. Build many independent detection methods -- interpretability audits, behavioral red-teaming, monitoring. Iterate on failures through trial and error. "Instead of aiming to get toward the aerospace end of the spectrum -- a project that has little hope of paying out if things move quickly -- I propose that we accept that our early AGI alignment work is very likely to depend critically on trial and error, and put up as many mechanisms as possible -- the bumpers -- to help us identify and respond to those errors."
What They Do
NLP benchmarks that defined the field. Bowman co-created SNLI (2015), MultiNLI (2018), GLUE (2018), and SuperGLUE (2019) -- the standard evaluation suite for NLP research for years. 64,000+ Google Scholar citations. This benchmarking expertise is the technical foundation for all subsequent alignment evaluation work.
Scalable oversight research. The sandwiching paper (Nov 2022) tested whether non-experts using AI assistants could match expert performance. Human+model teams scored 75-78% on MMLU/QuALITY vs. 57-67% for either alone. Results were "pretty discouraging" for standard CoT oversight but encouraging for debate-style approaches. GPQA (Nov 2023, David Rein) created a graduate-level benchmark where experts reach 65%, non-experts 34%, and GPT-4 39% -- designed specifically for scalable oversight experiments. "Debating with More Persuasive LLMs Leads to More Truthful Answers" won ICML 2024 Best Paper: non-expert humans achieved 88% accuracy reading debates between expert LLMs.
Alignment evaluation at Anthropic. Sabotage evaluations (2024) introduced four novel test types. The alignment faking paper (Dec 2024) showed Claude 3 Opus can strategically fake compliance at a 12% rate, rising to 78% after reinforcement learning. The pilot sabotage risk report for Claude Opus 4 (Oct 2025) was the first frontier-lab document describable as a misalignment safety case -- 60 pages, 9 threat pathways, reviewed by METR. The Anthropic-OpenAI joint evaluation (Summer 2025) tested models cross-organizationally.
Auditing infrastructure. Alignment auditing agents (2025) autonomously detect hidden goals in models, succeeding 13-42% of the time. Petri (2025) is an open-source tool for automated behavioral red-teaming, adopted by UK AISI. AuditBench (2026) provides 56 models with implanted hidden behaviors for benchmarking auditing methods. The overt saboteur experiment (Jan 2026) validated that auditing catches overt sabotage but found automated systems alone missed 2 of 3 saboteurs.
The Inverse Scaling Prize ($250K): found 11 datasets where larger models perform worse, challenging unconditional scaling optimism. Emergent misalignment (Nature 2026): narrow fine-tuning on insecure code produces broad behavioral shifts.
Key People
Sam Bowman -- PI and founder. Associate Professor at NYU, on long-term leave at Anthropic since June 2022. Leads Cognitive Oversight subteam of Alignment Science. MATS mentor with ~50 alumni. The intellectual engine behind both ARG and Anthropic's alignment evaluation program. Career arc: computational linguistics (Stanford PhD, 2016) to NLP benchmarking (SNLI/GLUE) to alignment evaluation. EA-adjacent: endorses Giving What We Can.
Ethan Perez -- Outside advisor/co-PI. Anthropic researcher leading adversarial robustness team. FAR AI co-founder. Co-author on almost every major Anthropic alignment paper (sleeper agents, alignment faking, sabotage evals, Constitutional AI). Forbes 30 Under 30 in AI. ARG connection is largely nominal -- his primary affiliation and work are at Anthropic.
Notable alumni destinations: David Rein (METR, GPQA creator), Jacob Pfau (UK AISI Alignment Team research lead), Asa Cooper Stickland (UK AISI), Samuel Arnesen (OpenAI Alignment), Jason Phang (OpenAI), Julian Michael (Meta AI safety), Miles Turpin (Scale AI), Shi Feng (GWU faculty, MATS mentor). Almost all alumni went to frontier labs or AI safety organizations.
Team at peak: ~12 researchers (2022). Current: effectively dormant as standalone entity. Bowman's FAQ states he is "not currently recruiting or supervising research students" at NYU.
Money and Incentives
No independently measurable budget. ARG is a university lab within NYU -- no separate legal entity, no 990, no public financial statements. All funding flows through NYU's general infrastructure.
Open Philanthropy funding confirmed but unquantified. Bowman states "some of our funding comes from Open Philanthropy" but no named grant to ARG exists in the Coefficient Giving database. ARG was likely funded through OP's $16.6M multi-recipient pool for "AI alignment projects working with deep learning systems." The amount is unknown.
Other known funding: NSF CRII grant of $175K (Bowman, 2019, for linguistics work predating ARG). Google Faculty Research Award (2017, amount unknown). PhD student Jane Pan has NSF GRFP + DeepMind Fellowship. He He's $2.08M OP grant for LLM cybersecurity is a separate project.
Financial dependency on Anthropic. Bowman and Perez both draw primary compensation from Anthropic. Bowman's research resources, compute access, publication venues, and institutional influence now flow through Anthropic, not NYU. The incentive structure is clear: Bowman's career interests are aligned with Anthropic's interests. Whether Anthropic's interests are aligned with alignment is the key question.
The structural conflict. ARG's research directly supports Anthropic's safety claims and RSP compliance. The alignment faking paper, sabotage evaluations, auditing agents, and the Bumpers strategy all serve Anthropic's public narrative that it is building safety infrastructure seriously. This is either genuine safety work or sophisticated safetywashing -- and the financial incentives don't help distinguish between the two.
What Others Say
"Wildly overconfident." LessWrong commenters on Bumpers criticized Bowman for claiming bumpers could "almost entirely mitigate the risk" of misaligned early AGI systems, calling this overconfident given that the evidence relies partly on asking models to self-report their misalignment. The selection pressure problem -- iterative testing creates optimization pressure that degrades test quality -- was highlighted as potentially fatal.
"Makes me more worried." Commenters on the Checklist stated that reading it was "evidence that AI developers are not thinking seriously about some major risks and risk factors." The document intended as reassurance instead alarmed safety-conscious readers about gaps in the plan.
MIRI's structural critique. MIRI's "So You Want to Work at a Frontier Lab" argues that safety work at frontier labs primarily enables continued scaling: "the primary output of frontier AI labs is research that accelerates capabilities" and it's "perniciously easy to 'safetywash' despite every intention to the contrary." The alignment-capabilities overlap means safety research often directly improves models.
Prosaic alignment skepticism. The Yudkowsky/MIRI position holds that current ML methods (RLHF, fine-tuning, scalable oversight) are fundamentally inadequate for the core alignment problem. If this view is correct, Bowman's entire research program operates within a paradigm that cannot solve the problem it targets.
Bowman's own negative results. The overt saboteur experiment (Jan 2026) found automated auditing alone caught only 1 of 3 saboteurs. AuditBench identified a "tool-to-agent gap" where tools effective in isolation fail when composed into agents. Bowman published "A Small Negative Result on Debate" sharing unfavorable findings on his own research agenda. This intellectual honesty is notable.
What's Absent
ARG's current operational status has never been formally announced. The group appears effectively dormant since Bowman's long-term Anthropic leave, but there is no public statement confirming or denying this. The homepage contains 103 words and no publications list.
No specific funding amount for ARG is publicly available despite confirmed OP support. No independent evaluation of ARG's research impact exists. No formal connection between ARG and peer academic alignment labs (MIT AAG, CHAI) has been documented. Mengye Ren, listed as collaborating PI, has no visible alignment publications.
There is no long-form podcast interview with Bowman about his current views (no 80K Hours episode). The most candid extended public remarks are the FAR AI talk (Dec 2025) and the CS224U podcast (Feb 2023, pre-ARG).
Recommended Reading
FAR AI Talk: "Lessons Learned from the First Misalignment Safety Case" (Dec 2025) -- Most candid view of how Bowman thinks about alignment. Admits the traditional capability-alignment-control framework doesn't work. Describes practical obstacles in the first sabotage risk report. https://far.ai/events/sessions/sam-bowman-lessons-learned-from-the-first-misalignment-safety-case
MIRI: "So You Want to Work at a Frontier Lab" (Jun 2025) -- The strongest counterargument. Argues that safety work at labs enables scaling more than it constrains it, and that safetywashing is "perniciously easy." https://intelligence.org/2025/06/11/so-you-want-to-work-at-a-frontier-ai-lab/
"Putting Up Bumpers" (Apr 2025) -- Bowman's most mature alignment strategy. Essential for understanding both the plan and its acknowledged limitations. https://alignment.anthropic.com/2025/bumpers/
"Why I Think More NLP Researchers Should Engage with AI Safety Concerns" (Oct 2022) -- ARG's founding manifesto. Clear, accessible, reveals Bowman's motivations and worldview. https://wp.nyu.edu/arg/why-ai-safety/
CS224U Podcast with Chris Potts (Feb 2023) -- Bowman's intellectual autobiography from SNLI to alignment. Includes nuanced discussion of the alignment/capabilities distinction and the cultural tensions in the field. https://web.stanford.edu/class/cs224u/podcast/bowman/