NYU Alignment Research Group (ARG)

Research

Sam Bowman. Scalable oversight.

Founded: 2022
HQ: New York, NY
Team: 3
Structure: university-affiliated
Model: Grants

Theory of Change

ARG's theory of change evolved across three distinct phases, all articulated by founder Sam Bowman.

Phase 1 (Oct 2022, ARG founding): "We're making progress quickly but chaotically... the downside risk is also potentially catastrophic, and it doesn't look like we're prepared to manage that risk." The causal chain: NLP researchers with empirical skills should work on alignment-relevant problems -- debate, amplification, scalable oversight, chain-of-thought faithfulness -- producing evidence about whether these methods can bootstrap human oversight of superhuman systems. The "sandwiching paradigm" would let non-experts use AI tools to solve problems verified by experts, demonstrating that oversight scales.

Phase 2 (Sep 2024, "The Checklist"): Scalable oversight is "the central pillar" of alignment work. Three chapters track from current models through TAI to post-TAI. Safety cases, RSPs, automated evaluations, and external adjudication must all be in place before the transition to broadly superhuman systems. Bowman states directly: "Things are likely more urgent than they appear." He assumes TAI this decade.

Phase 3 (Apr 2025, "Putting Up Bumpers"): Accept that alignment won't be solved definitively before AGI. Build many independent detection methods -- interpretability audits, behavioral red-teaming, monitoring. Iterate on failures through trial and error. "Instead of aiming to get toward the aerospace end of the spectrum -- a project that has little hope of paying out if things move quickly -- I propose that we accept that our early AGI alignment work is very likely to depend critically on trial and error, and put up as many mechanisms as possible -- the bumpers -- to help us identify and respond to those errors."

What They Do

NLP benchmarks that defined the field. Bowman co-created SNLI (2015), MultiNLI (2018), GLUE (2018), and SuperGLUE (2019) -- the standard evaluation suite for NLP research for years. 64,000+ Google Scholar citations. This benchmarking expertise is the technical foundation for all subsequent alignment evaluation work.

Scalable oversight research. The sandwiching paper (Nov 2022) tested whether non-experts using AI assistants could match expert performance. Human+model teams scored 75-78% on MMLU/QuALITY vs. 57-67% for either alone. Results were "pretty discouraging" for standard CoT oversight but encouraging for debate-style approaches. GPQA (Nov 2023, David Rein) created a graduate-level benchmark where experts reach 65%, non-experts 34%, and GPT-4 39% -- designed specifically for scalable oversight experiments. "Debating with More Persuasive LLMs Leads to More Truthful Answers" won ICML 2024 Best Paper: non-expert humans achieved 88% accuracy reading debates between expert LLMs.

Alignment evaluation at Anthropic. Sabotage evaluations (2024) introduced four novel test types. The alignment faking paper (Dec 2024) showed Claude 3 Opus can strategically fake compliance at a 12% rate, rising to 78% after reinforcement learning. The pilot sabotage risk report for Claude Opus 4 (Oct 2025) was the first frontier-lab document describable as a misalignment safety case -- 60 pages, 9 threat pathways, reviewed by METR. The Anthropic-OpenAI joint evaluation (Summer 2025) tested models cross-organizationally.

Auditing infrastructure. Alignment auditing agents (2025) autonomously detect hidden goals in models, succeeding 13-42% of the time. Petri (2025) is an open-source tool for automated behavioral red-teaming, adopted by UK AISI. AuditBench (2026) provides 56 models with implanted hidden behaviors for benchmarking auditing methods. The overt saboteur experiment (Jan 2026) validated that auditing catches overt sabotage but found automated systems alone missed 2 of 3 saboteurs.

The Inverse Scaling Prize ($250K): found 11 datasets where larger models perform worse, challenging unconditional scaling optimism. Emergent misalignment (Nature 2026): narrow fine-tuning on insecure code produces broad behavioral shifts.

Key People

Sam Bowman -- PI and founder. Associate Professor at NYU, on long-term leave at Anthropic since June 2022. Leads Cognitive Oversight subteam of Alignment Science. MATS mentor with ~50 alumni. The intellectual engine behind both ARG and Anthropic's alignment evaluation program. Career arc: computational linguistics (Stanford PhD, 2016) to NLP benchmarking (SNLI/GLUE) to alignment evaluation. EA-adjacent: endorses Giving What We Can.

Ethan Perez -- Outside advisor/co-PI. Anthropic researcher leading adversarial robustness team. FAR AI co-founder. Co-author on almost every major Anthropic alignment paper (sleeper agents, alignment faking, sabotage evals, Constitutional AI). Forbes 30 Under 30 in AI. ARG connection is largely nominal -- his primary affiliation and work are at Anthropic.

Notable alumni destinations: David Rein (METR, GPQA creator), Jacob Pfau (UK AISI Alignment Team research lead), Asa Cooper Stickland (UK AISI), Samuel Arnesen (OpenAI Alignment), Jason Phang (OpenAI), Julian Michael (Meta AI safety), Miles Turpin (Scale AI), Shi Feng (GWU faculty, MATS mentor). Almost all alumni went to frontier labs or AI safety organizations.

Team at peak: ~12 researchers (2022). Current: effectively dormant as standalone entity. Bowman's FAQ states he is "not currently recruiting or supervising research students" at NYU.

Money and Incentives

No independently measurable budget. ARG is a university lab within NYU -- no separate legal entity, no 990, no public financial statements. All funding flows through NYU's general infrastructure.

Open Philanthropy funding confirmed but unquantified. Bowman states "some of our funding comes from Open Philanthropy" but no named grant to ARG exists in the Coefficient Giving database. ARG was likely funded through OP's $16.6M multi-recipient pool for "AI alignment projects working with deep learning systems." The amount is unknown.

Other known funding: NSF CRII grant of $175K (Bowman, 2019, for linguistics work predating ARG). Google Faculty Research Award (2017, amount unknown). PhD student Jane Pan has NSF GRFP + DeepMind Fellowship. He He's $2.08M OP grant for LLM cybersecurity is a separate project.

Financial dependency on Anthropic. Bowman and Perez both draw primary compensation from Anthropic. Bowman's research resources, compute access, publication venues, and institutional influence now flow through Anthropic, not NYU. The incentive structure is clear: Bowman's career interests are aligned with Anthropic's interests. Whether Anthropic's interests are aligned with alignment is the key question.

The structural conflict. ARG's research directly supports Anthropic's safety claims and RSP compliance. The alignment faking paper, sabotage evaluations, auditing agents, and the Bumpers strategy all serve Anthropic's public narrative that it is building safety infrastructure seriously. This is either genuine safety work or sophisticated safetywashing -- and the financial incentives don't help distinguish between the two.

What Others Say

"Wildly overconfident." LessWrong commenters on Bumpers criticized Bowman for claiming bumpers could "almost entirely mitigate the risk" of misaligned early AGI systems, calling this overconfident given that the evidence relies partly on asking models to self-report their misalignment. The selection pressure problem -- iterative testing creates optimization pressure that degrades test quality -- was highlighted as potentially fatal.

"Makes me more worried." Commenters on the Checklist stated that reading it was "evidence that AI developers are not thinking seriously about some major risks and risk factors." The document intended as reassurance instead alarmed safety-conscious readers about gaps in the plan.

MIRI's structural critique. MIRI's "So You Want to Work at a Frontier Lab" argues that safety work at frontier labs primarily enables continued scaling: "the primary output of frontier AI labs is research that accelerates capabilities" and it's "perniciously easy to 'safetywash' despite every intention to the contrary." The alignment-capabilities overlap means safety research often directly improves models.

Prosaic alignment skepticism. The Yudkowsky/MIRI position holds that current ML methods (RLHF, fine-tuning, scalable oversight) are fundamentally inadequate for the core alignment problem. If this view is correct, Bowman's entire research program operates within a paradigm that cannot solve the problem it targets.

Bowman's own negative results. The overt saboteur experiment (Jan 2026) found automated auditing alone caught only 1 of 3 saboteurs. AuditBench identified a "tool-to-agent gap" where tools effective in isolation fail when composed into agents. Bowman published "A Small Negative Result on Debate" sharing unfavorable findings on his own research agenda. This intellectual honesty is notable.

What's Absent

ARG's current operational status has never been formally announced. The group appears effectively dormant since Bowman's long-term Anthropic leave, but there is no public statement confirming or denying this. The homepage contains 103 words and no publications list.

No specific funding amount for ARG is publicly available despite confirmed OP support. No independent evaluation of ARG's research impact exists. No formal connection between ARG and peer academic alignment labs (MIT AAG, CHAI) has been documented. Mengye Ren, listed as collaborating PI, has no visible alignment publications.

There is no long-form podcast interview with Bowman about his current views (no 80K Hours episode). The most candid extended public remarks are the FAR AI talk (Dec 2025) and the CS224U podcast (Feb 2023, pre-ARG).

Stated Theory of Change

ARG's stated theory evolved significantly over three years. The founding claim (2022): NLP progress is feeding into powerful technologies with catastrophic downside risk, empirical researchers should work on alignment via debate, amplification, and scalable oversight. The mechanism: if we can demonstrate that non-expert humans using AI tools can match expert supervision, we have a path to aligning systems that exceed human capabilities.

By 2024-2025, Bowman's theory had evolved: alignment probably cannot be solved before AGI arrives ("this is a humbling and disconcerting state of affairs, but it's one that we shouldn't look away from"). The revised mechanism: build many independent detection methods (bumpers), iterate on failures, and "muddle through" early AGI. Scalable oversight remains the "central pillar" but the Bumpers approach explicitly abandons the goal of highly confident safety arguments in favor of trial-and-error.

The theory requires that: (1) misalignment is detectable before it becomes catastrophic, (2) multiple independent detection methods are collectively sufficient even if individually unreliable, (3) the competitive landscape allows developers time to iterate, and (4) early AGI systems can be used to accelerate alignment research faster than they accelerate capabilities.

Revealed Theory of Change

Bowman's actions track his stated theory remarkably well. The research arc from SNLI to GPQA to AuditBench shows consistent execution on the evaluation/oversight agenda. The scalable oversight paper, debate experiments, and alignment auditing agents are all direct implementations of the founding research agenda.

However, there are two significant divergences between stated and revealed theory:

1. ARG became an Anthropic pipeline, not an independent lab. The founding vision was a university lab doing empirical alignment research. The reality: Bowman moved to Anthropic within months of founding ARG, the research program migrated with him, and almost every alumnus went to frontier labs. ARG's "independence" was always nominal -- the research served Anthropic's needs and was published on Anthropic's blog.

2. The shift from solving alignment to muddling through. The 2022 vision was optimistic about finding alignment techniques that work. The 2025 Bumpers vision explicitly accepts that alignment won't be solved and proposes trial-and-error instead. This is either intellectual honesty (updating on evidence) or goal-post shifting (defining success downward to match reality). The evidence supports honest updating: Bowman published negative results on debate, acknowledged CoT oversight was "discouraging," and discovered alignment faking. These findings genuinely changed his view.

Key Assumptions

A1: Misalignment is detectable before catastrophe. The bumpers approach assumes warning signs of misalignment will appear before a model causes irreversible harm. Evidence for: alignment faking was detectable in current models; probes catch sleeper agents with >99% AUROC. Evidence against: these detections work on manufactured examples, not natural deception. The tool-to-agent gap shows that detection tools effective in isolation fail when composed into real systems. Testable: the overt-saboteur experiment already showed automated detection failing on 2/3 cases.

A2: Diversity of methods compensates for individual weakness. Bowman argues "a half dozen of these methods, each implemented to the standard of a substantial production effort, should be sufficient to detect potentially-catastrophic forms of misalignment." Evidence for: different detection methods catch different issues. Evidence against: if all methods share a common failure mode (e.g., inability to detect misalignment that doesn't manifest in behavior), diversity doesn't help. No formal analysis of correlated failure modes has been published.

A3: The competitive landscape allows time to iterate. The Bumpers approach requires developers to "take however much time is needed to make early AGI systems demonstrably safe." Bowman's own Checklist acknowledges this depends on "potentially very demanding constraints around safety" and staying near the frontier. If competitors don't share safety commitments, the iteration time collapses.

A4: Prosaic methods can detect sophisticated deception. This is the MIRI objection. If sufficiently capable systems can reason about their situation and strategically evade all behavioral and interpretability-based tests, the entire Bumpers approach fails. Bowman acknowledges this in the "hitting the capability ceiling" section of Bumpers but argues early AGI systems won't yet be in this regime.

Strengths

Intellectual honesty and self-correction. Bowman published negative results on debate, acknowledged Bumpers is not "a costless choice," admitted the traditional safety case framework "doesn't quite work," and shared alignment faking results that undermine confidence in standard training. This willingness to report uncomfortable findings is rare and valuable.

Evaluation expertise without parallel. The through-line from SNLI to GPQA to AuditBench represents deep, cumulative expertise in building evaluations that reveal model capabilities. Very few researchers in the world have Bowman's combination of evaluation design experience and alignment motivation.

Concrete tools adopted by the field. GPQA is a standard frontier model benchmark. Petri was adopted by UK AISI. The sabotage evaluation framework feeds into Anthropic's RSP. The Anthropic-OpenAI joint evaluation was unprecedented cross-organization cooperation. These are not just papers -- they are infrastructure that other organizations use.

Strong alumni pipeline. ARG alumni are in senior alignment positions at METR (Rein), UK AISI (Pfau, Stickland), OpenAI Alignment (Arnesen), and Meta safety (Michael). The MATS mentorship extends this to ~50 researchers. If these people carry alignment motivation into their organizations, the indirect impact is significant.

Bridge between NLP and alignment communities. Bowman's credibility in mainstream NLP (via SNLI/GLUE) gives his safety work legitimacy with researchers who might dismiss alignment concerns from less established voices. "Eight Things to Know about LLMs" and the underclaiming paper explicitly targeted NLP community attitudes.

Weaknesses and Risks

Complete financial dependency on Anthropic. Bowman's resources, compute, publication venue, and influence all flow through Anthropic. There is no independent funding stream, no independent governance, and no mechanism to pursue research that conflicts with Anthropic's interests. If Anthropic deprioritizes safety or pressures the alignment team, Bowman's only leverage is his NYU appointment -- but he's not actively using it.

The safetywashing problem. Every piece of research Bowman publishes at Anthropic serves dual purposes: advancing alignment knowledge AND supporting Anthropic's public claim that it takes safety seriously. The alignment faking paper, the Bumpers strategy, and the pilot risk report all help Anthropic's reputation and regulatory positioning. Even if the research is genuine, it is inseparable from Anthropic's commercial interests.

Automated auditing is insufficient. Bowman's own experiments show automated systems failing to catch 2 of 3 overt saboteurs and a significant tool-to-agent gap. The Bumpers approach depends on automated detection scaling to match deployment -- but current evidence suggests it doesn't scale.

ARG produced no lasting independent institution. The university lab is dormant. Alumni dispersed to frontier labs. No independent evaluation organization, no ongoing academic program, no permanent infrastructure remains. The "alignment research group" is now Bowman's subteam at Anthropic.

Selection pressure is potentially fatal. Each iteration of the Bumpers loop applies selection pressure against the detection methods. Bowman acknowledges this but argues diverse bumpers can survive "thousands" of iterations. This is an optimistic estimate with no formal analysis. If alignment is sufficiently hard that many iterations are needed, the selection pressure could undermine the entire approach.

Cross-References

vs. MIT Algorithmic Alignment Group (AAG): Both are elite university alignment labs, but they represent opposite approaches. AAG is theoretical (assistance games, incomplete contracting), independent from labs, and policy-engaged. ARG is empirical (evaluation, benchmarking), deeply embedded in Anthropic, and focused on near-term technical work. AAG's weakness is the theory-practice gap; ARG's weakness is the independence gap. AAG retained institutional independence but lacks frontier access; ARG has frontier access but lost independence. Bowman's pipeline to frontier labs is stronger than DHM's; DHM's policy engagement is stronger than Bowman's. A researcher who values theoretical clarity and institutional independence should look at AAG. A researcher who values empirical impact and frontier access should look at ARG's Anthropic home.

vs. METR: Bowman's most direct connection -- David Rein (ARG alumnus) is at METR, and METR reviewed the pilot sabotage risk report. METR represents the independent evaluation model that ARG could have become but didn't. METR evaluates labs from outside; Bowman evaluates from inside. Both are needed, but the independent perspective is rarer.

vs. FAR AI: Ethan Perez co-founded FAR AI. The FAR AI talk venue and Perez's dual affiliation (Anthropic + FAR AI) show the blurry boundaries between "independent safety research" and "lab-affiliated work."

vs. UK AISI: Two ARG alumni (Pfau, Stickland) now work at UK AISI, which adopted Petri. This pipeline from ARG to a government safety evaluation body is one of ARG's most underappreciated contributions.

What Would Change This Assessment

Evidence that Bumpers approach catches a sophisticated, naturally-arising misalignment (not just manufactured examples) would significantly increase confidence in the theory of change.
Bowman publicly disagreeing with an Anthropic deployment or scaling decision would demonstrate real independence.
ARG reconstituting as an active academic lab (Bowman returning to NYU, hiring students) would address the independence gap.
A formal analysis of correlated failure modes across bumper methods would address the selection pressure concern.
Evidence that ARG alumni at frontier labs are measurably shifting lab behavior (e.g., blocking a deployment, changing a policy) would validate the pipeline theory.
Automated auditing catching a subtle, covert saboteur without human assistance would address the scalability concern.

Self-Critique

Weakest claim: My assertion that ARG "produced no lasting independent institution" may be unfair. The MATS pipeline, the open-source tools (Petri, AuditBench), and the alumni in safety positions are a form of institutional legacy -- just not a brick-and-mortar one.

Potential bias: I may be too influenced by the MIRI critique of frontier lab safety work. Bowman's intellectual honesty, willingness to publish negative results, and concrete tools adopted by independent organizations (UK AISI, METR) all suggest his work is genuine safety research, not just safetywashing. The "safetywashing" frame may be unfairly applied to someone who has consistently reported findings that make his employer look worse (alignment faking, sycophancy universality, automated auditing failures).

What a thoughtful critic would say: "You've analyzed ARG as if it were a standalone organization, but it was always Bowman's personal research trajectory with a university label. The 'group' was 12 people for 2 years. The real story is Bowman's individual impact at Anthropic, which is substantial. Judging ARG as an institution misses the point."

Single weakest claim: The suggestion that ARG's alumni pipeline to frontier labs might represent "talent extraction rather than independent safety capacity." Every alumnus I can verify is doing alignment-relevant work. Placing alignment-motivated researchers inside frontier labs may genuinely be the highest-leverage thing a university group can do.

Information that would most change my view: (1) The actual amount of OP funding to ARG -- this would reveal whether it was a real lab or a label. (2) Evidence that Bowman has or has not influenced specific Anthropic deployment decisions -- this would reveal whether his position translates to real safety leverage. (3) A formal analysis showing selection pressure against bumpers is or isn't manageable -- this would validate or invalidate the core theory of change.

Connected to (11)

Anthropicstaff to · Sam Bowman

Anthropicstaff to · Ethan Perez

FAR AIadvisor at · Ethan Perez

MATScollaborator · Sam Bowman

Metastaff to · Julian Michael

New York Universityspun off from

OpenAIstaff to · Samuel Arnesen

OpenAIstaff to · Jason Phang

UK AISIstaff to · Jacob Pfau

UK AISIstaff to · Asa Cooper Stickland

METRstaff to · David Rein

Sources (42)

Every URL that was read during research.

1.Homewp.nyu.edu
2.Peoplewp.nyu.edu
3.Why I Think More NLP Researchers Should Engage with AI Safety Concernswp.nyu.edu
4.Sam Bowmansleepinyourhat.github.io
5.Old Pagecims.nyu.edu
6.CS224U: Natural Language Understandingweb.stanford.edu
7.Eight Things to Know about Large Language Modelsarxiv.org
8.Measuring Progress on Scalable Oversight for Large Language Modelsar5iv.labs.arxiv.org
9.Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercisealignment.anthropic.com
10.Building and evaluating alignment auditing agentsalignment.anthropic.com
11.Jacob Pfaujacobpfau.com
12.Anthropic and OpenAI Megastream at MATS: Summer 2026matsprogram.org
13.The Checklist: What Succeeding at AI Safety Will Involve - Sam Bowmansleepinyourhat.github.io
14.Blog - Sam Bowmansleepinyourhat.github.io
15.Putting up Bumpers - Sam Bowmansleepinyourhat.github.io
16.Putting up Bumpersalignment.anthropic.com
17.Unknownhhexiy.github.io
18.What is the NYU Alignment Research Group's research agenda?aisafety.info
19.The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Failarxiv.org
20.David Reinidavidrein.com
21.Ethan Perezethanperez.net
22.GPQA: A Graduate-Level Google-Proof Q&A Benchmarkarxiv.org
23.Debating with More Persuasive LLMs Leads to More Truthful Answersarxiv.org
24.Sabotage evaluations for frontier modelsanthropic.com
25.Anthropic's Pilot Sabotage Risk Reportalignment.anthropic.com
26.Lessons Learned from the First Misalignment Safety Case | Events at FAR.AIfar.ai
27.An alignment safety case sketch based on debatearxiv.org
28.Tim G. J. Rudnertimrudner.com
29.Welcome!janepan9917.github.io
30.Aboutjacksonpetty.org
31.Alignment faking in large language modelsanthropic.com
32.Simple probes can catch sleeper agentsanthropic.com
33.Publications - Sam Bowmansleepinyourhat.github.io
34.Alignment Science Blogalignment.anthropic.com
35.Petri: An open-source auditing tool to accelerate AI safety researchanthropic.com
36.Inverse Scaling: When Bigger Isn't Betterarxiv.org
37.Five Questions For... Sam Bowmansamstack.io
38.FAQ - Sam Bowmansleepinyourhat.github.io
39.AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviorsalignment.anthropic.com
40.Pre-deployment auditing can catch an overt saboteuralignment.anthropic.com
41.Emergent Misalignmentemergent-misalignment.com
42.Navigating Transformative AIopenphilanthropy.org