Theory of Change
METR's stated mission: "Develop scientific methods to assess catastrophic risks stemming from AI systems' autonomous capabilities and enable good decision-making about their development." Beth Barnes puts it more plainly: "Have the world not be taken by surprise by dangerous AI stuff happening."
The causal chain: METR creates rigorous, human-calibrated benchmarks that measure how capable AI models are at autonomous action. These measurements inform AI developers, policymakers, and the public about how close models are to dangerous capability thresholds. When combined with binding policies (like the Responsible Scaling Policies METR helped design), these measurements should trigger protective actions before models become dangerous.
METR originated the RSP framework (September 2023), which gates further capability scaling on concrete evaluation results. RSPs have been adopted by 9+ AI developers including Anthropic, OpenAI, and Google DeepMind. This is arguably METR's single largest policy impact.
Barnes on METR's self-assessment of past work: "I wouldn't want to describe any of the things that we've done thus far as actually providing meaningful oversight. There's a bunch of constraints, including the stuff we were doing was under NDA, so we didn't have formal authorisation to alert anyone or say if we thought things were concerning."
What They Do
Core research output. METR's Time Horizons benchmark has become the most widely cited measure of AI autonomous capability progress. Key finding: the length of tasks AI agents can complete (50% success rate) has doubled approximately every 7 months since 2019, accelerating to ~4.3 months for post-2023 models. Current frontier models achieve 50% time horizons of 2-3 hours. This work has been covered by Nature, the New York Times, and MIT Technology Review.
Pre-deployment evaluations. METR has evaluated nearly every major frontier model: GPT-4, GPT-4o, GPT-4.5, o1-preview, o3, o4-mini, GPT-5, GPT-5.1-Codex-Max, Claude 3.5 Sonnet, Claude 3.7, Claude Opus 4.6, DeepSeek-V3/R1, and others. No other independent organization matches this breadth.
Counterintuitive findings. A randomized controlled trial found experienced developers were 19% slower with AI tools, while perceiving themselves as 20% faster -- a nearly 40% perception gap. This demonstrates METR's willingness to publish results that complicate mainstream AI narratives.
Evaluation integrity research. METR's GPT-5 evaluation discovered the model reasoning about being evaluated, correctly identifying METR. The o3 evaluation found "higher propensity to cheat or hack tasks in sophisticated ways" -- 1-2% of task attempts contained reward hacking. The MALT dataset released 10,919 agent transcripts documenting problematic behaviors including 103 unprompted reward hacking cases.
Infrastructure and policy. METR builds open-source evaluation tools (Task Standard, HCAST benchmark), provides technical assistance to the European AI Office, and maintains a tracker of frontier safety policies across 12+ companies. The Canary collaboration with RAND ($38M total, ~$17M to METR) aims to build evaluation infrastructure by 2027.
Key People
Beth Barnes -- Founder, Head of Research (transitioned from CEO in April 2024). Former research assistant to Shane Legg at DeepMind, then alignment researcher at OpenAI. Named TIME100 AI 2024. The most vocal spokesperson for METR's worldview; her 80K Hours interview is the most candid source on the organization. Barnes puts P(things broadly go well) at roughly 50/50 and says "METR is super starved for talent."
Emma Abele -- Executive Director (from April 2024). Handles operations. Also on 80,000 Hours board.
Ajeya Cotra -- Technical staff (threat modeling). Previously led technical AI safety grantmaking at Coefficient Giving (formerly Open Philanthropy). Married to Paul Christiano, who founded ARC (METR's parent org) and now leads safety at US AISI.
Team size is not publicly disclosed. Based on funding scale and named staff, likely 20-40 people, with Barnes referring to "our 10 researchers or whatever." METR is constrained primarily by talent rather than funding.
Money and Incentives
Total budget/revenue. No 990 filings available (nonprofit ruling date 2024-03-01). Based on known grants, METR's post-Audacious annual budget is likely $8-15M+.
Revenue breakdown. METR is almost entirely grant-funded. Major sources:
- Audacious Project (TED): ~$17M (2024) -- largest single grant, part of $38M Canary collaboration with RAND
- Open Phil/Coefficient Giving: $1.515M to ARC (2022, pre-METR)
- Foundations: Sijbrandij Foundation, Pew Charitable Trusts, Schmidt Sciences, La Centra-Sumerlin Foundation, Astralis Foundation, Expa.org
- Pooled funds via Longview Philanthropy, Effektiv Spenden, SFF recommendations
- Individual donors: David Farhi, Dylan Field, Geoff Ralston
- European AI Office: small technical assistance contract (only earned revenue)
Business model. Pure nonprofit funded by donations. Zvi Mowshowitz assessed funding needs as "None Whatsoever" post-Audacious.
Independence from AI labs. METR explicitly does not accept cash from AI companies. However, METR uses "significant free compute credits" from labs -- the scale and terms are undisclosed. This creates a material dependency: running evaluations requires compute, and labs provide it for free. If a lab revoked access, METR's ability to evaluate that lab would be impaired.
Structural tensions. Evaluations have been conducted under NDAs that prevented METR from alerting anyone about concerns. Labs set the timing, terms, and access level. Barnes says labs want METR to provide a rubber stamp ("run an eval on all of our models") while METR wants meaningful oversight (pre-scaleup evaluations, embedded evaluators, governance mechanisms). The OpenAI GPT-5 evaluation was approved by OpenAI's comms and legal team -- a precedent Barnes says is atypical.
Compensation challenge. METR pays competitive cash compensation matching AI company base salaries, but cannot match equity packages. This is the single biggest barrier to hiring. Barnes: "We could 10x what we were paying people, and then maybe we'd be competitive with lab equity."
What Others Say
Alfour (ControlAI advisor): "As far as I am aware, no company was ever forced in any way as the result of external Evaluations. I believe there never was a model blocked, postponed or constrained before deployment, let alone during development." He argues evals reverse the burden of proof -- putting it on evaluators to prove danger rather than labs to prove safety. The theory of change "is broken" without binding regulations that don't exist.
Beth Barnes (responding to the genre of critique): She doesn't directly rebut the "no model was ever blocked" claim. Instead, she argues evals serve an informational function -- moving understanding from inside labs (where it advantages capabilities progress) to the broader public and policymakers. "Security by obscurity about what models can do is not a great safety strategy."
Empiricrafting (methodological critique): METR's human baselines may be inflated (repo maintainers complete tasks 5-18x faster than contract baseliners). Algorithmic scoring via test cases "dramatically overestimates actual capability" (38% test pass rate but 0% mergeable PRs in one analysis).
Jonas Moss (statistical critique): METR's data "can't distinguish between trajectories" -- the uncertainty is too large to differentiate between linear, polynomial, and exponential growth. The specific 7-month doubling time is less certain than presented.
Zvi Mowshowitz (SFF recommender): Rates METR "High Confidence" for quality of work, but notes no funding gap exists post-Audacious.
Sean Goedecke (external reviewer): Praises the developer productivity RCT as "really good" -- the gold standard methodology and willingness to publish counterintuitive results demonstrates intellectual honesty.
What's Absent
No model has ever been blocked, delayed, or constrained as a result of METR's evaluations. Every model METR has evaluated has been deployed.
No public board of directors or governance structure is visible. For a $17M+ nonprofit conducting critical safety evaluations, this is unusual.
No record of METR's recommendations being followed or overruled by any lab. The entire recommendation-response chain is invisible due to NDAs.
No information about staff departures. No 990 financial filings. No published assessment of whether RSPs have changed lab behavior.
Barnes says METR plans to move into alignment and control evaluations, but as of March 2026 has only published capability evaluations.
Recommended Reading
Beth Barnes on 80,000 Hours Podcast (#217) (https://80000hours.org/podcast/episodes/beth-barnes-ai-safety-evals/) -- The most candid and unfiltered source. Barnes explicitly admits METR hasn't provided "meaningful oversight," puts P(doom) around 50%, argues most alignment researchers should leave labs, and draws extended parallels to nuclear weapons governance failures. Start here.
"Why AI Evaluation Regimes are bad" (Alfour) (https://cognition.cafe/p/why-ai-evaluation-regimes-are-bad) -- The strongest counterargument to METR's entire approach. Argues evals without binding regulation are actively harmful because they shift burden of proof to the public.
GPT-5 Evaluation Report (https://evaluations.metr.org/gpt-5-report/) -- METR's most detailed evaluation, showing evaluation awareness, sandbagging concerns, and the NDA constraints on independence.
AXRP Episode 34 (https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html) -- Earlier, more technical interview with Barnes covering what evals can and cannot prove, the gap between measuring capability and measuring danger, and how METR interfaces with labs.
"Are We There Yet?" (Empiricrafting) (https://empiricrafting.substack.com/p/are-we-there-yet-evaluating-metrs) -- Detailed methodological critique of the Time Horizons benchmark. Points to inflated baselines and questionable scoring. Essential reading to calibrate how much confidence to place in the headline doubling-time number.