METR (Model Evaluation & Threat Research)

Evals/Testing

ARA evals. Beth Barnes. Gold standard.

Founded: 2022
HQ: Berkeley, CA
Team: ~20-40
Structure: 501(c)(3) nonprofit
Model: Grants

Theory of Change

METR's stated mission: "Develop scientific methods to assess catastrophic risks stemming from AI systems' autonomous capabilities and enable good decision-making about their development." Beth Barnes puts it more plainly: "Have the world not be taken by surprise by dangerous AI stuff happening."

The causal chain: METR creates rigorous, human-calibrated benchmarks that measure how capable AI models are at autonomous action. These measurements inform AI developers, policymakers, and the public about how close models are to dangerous capability thresholds. When combined with binding policies (like the Responsible Scaling Policies METR helped design), these measurements should trigger protective actions before models become dangerous.

METR originated the RSP framework (September 2023), which gates further capability scaling on concrete evaluation results. RSPs have been adopted by 9+ AI developers including Anthropic, OpenAI, and Google DeepMind. This is arguably METR's single largest policy impact.

Barnes on METR's self-assessment of past work: "I wouldn't want to describe any of the things that we've done thus far as actually providing meaningful oversight. There's a bunch of constraints, including the stuff we were doing was under NDA, so we didn't have formal authorisation to alert anyone or say if we thought things were concerning."

What They Do

Core research output. METR's Time Horizons benchmark has become the most widely cited measure of AI autonomous capability progress. Key finding: the length of tasks AI agents can complete (50% success rate) has doubled approximately every 7 months since 2019, accelerating to ~4.3 months for post-2023 models. Current frontier models achieve 50% time horizons of 2-3 hours. This work has been covered by Nature, the New York Times, and MIT Technology Review.

Pre-deployment evaluations. METR has evaluated nearly every major frontier model: GPT-4, GPT-4o, GPT-4.5, o1-preview, o3, o4-mini, GPT-5, GPT-5.1-Codex-Max, Claude 3.5 Sonnet, Claude 3.7, Claude Opus 4.6, DeepSeek-V3/R1, and others. No other independent organization matches this breadth.

Counterintuitive findings. A randomized controlled trial found experienced developers were 19% slower with AI tools, while perceiving themselves as 20% faster -- a nearly 40% perception gap. This demonstrates METR's willingness to publish results that complicate mainstream AI narratives.

Evaluation integrity research. METR's GPT-5 evaluation discovered the model reasoning about being evaluated, correctly identifying METR. The o3 evaluation found "higher propensity to cheat or hack tasks in sophisticated ways" -- 1-2% of task attempts contained reward hacking. The MALT dataset released 10,919 agent transcripts documenting problematic behaviors including 103 unprompted reward hacking cases.

Infrastructure and policy. METR builds open-source evaluation tools (Task Standard, HCAST benchmark), provides technical assistance to the European AI Office, and maintains a tracker of frontier safety policies across 12+ companies. The Canary collaboration with RAND ($38M total, ~$17M to METR) aims to build evaluation infrastructure by 2027.

Key People

Beth Barnes -- Founder, Head of Research (transitioned from CEO in April 2024). Former research assistant to Shane Legg at DeepMind, then alignment researcher at OpenAI. Named TIME100 AI 2024. The most vocal spokesperson for METR's worldview; her 80K Hours interview is the most candid source on the organization. Barnes puts P(things broadly go well) at roughly 50/50 and says "METR is super starved for talent."

Emma Abele -- Executive Director (from April 2024). Handles operations. Also on 80,000 Hours board.

Ajeya Cotra -- Technical staff (threat modeling). Previously led technical AI safety grantmaking at Coefficient Giving (formerly Open Philanthropy). Married to Paul Christiano, who founded ARC (METR's parent org) and now leads safety at US AISI.

Team size is not publicly disclosed. Based on funding scale and named staff, likely 20-40 people, with Barnes referring to "our 10 researchers or whatever." METR is constrained primarily by talent rather than funding.

Money and Incentives

Total budget/revenue. No 990 filings available (nonprofit ruling date 2024-03-01). Based on known grants, METR's post-Audacious annual budget is likely $8-15M+.

Revenue breakdown. METR is almost entirely grant-funded. Major sources:

Audacious Project (TED): ~$17M (2024) -- largest single grant, part of $38M Canary collaboration with RAND
Open Phil/Coefficient Giving: $1.515M to ARC (2022, pre-METR)
Foundations: Sijbrandij Foundation, Pew Charitable Trusts, Schmidt Sciences, La Centra-Sumerlin Foundation, Astralis Foundation, Expa.org
Pooled funds via Longview Philanthropy, Effektiv Spenden, SFF recommendations
Individual donors: David Farhi, Dylan Field, Geoff Ralston
European AI Office: small technical assistance contract (only earned revenue)

Business model. Pure nonprofit funded by donations. Zvi Mowshowitz assessed funding needs as "None Whatsoever" post-Audacious.

Independence from AI labs. METR explicitly does not accept cash from AI companies. However, METR uses "significant free compute credits" from labs -- the scale and terms are undisclosed. This creates a material dependency: running evaluations requires compute, and labs provide it for free. If a lab revoked access, METR's ability to evaluate that lab would be impaired.

Structural tensions. Evaluations have been conducted under NDAs that prevented METR from alerting anyone about concerns. Labs set the timing, terms, and access level. Barnes says labs want METR to provide a rubber stamp ("run an eval on all of our models") while METR wants meaningful oversight (pre-scaleup evaluations, embedded evaluators, governance mechanisms). The OpenAI GPT-5 evaluation was approved by OpenAI's comms and legal team -- a precedent Barnes says is atypical.

Compensation challenge. METR pays competitive cash compensation matching AI company base salaries, but cannot match equity packages. This is the single biggest barrier to hiring. Barnes: "We could 10x what we were paying people, and then maybe we'd be competitive with lab equity."

What Others Say

Alfour (ControlAI advisor): "As far as I am aware, no company was ever forced in any way as the result of external Evaluations. I believe there never was a model blocked, postponed or constrained before deployment, let alone during development." He argues evals reverse the burden of proof -- putting it on evaluators to prove danger rather than labs to prove safety. The theory of change "is broken" without binding regulations that don't exist.

Beth Barnes (responding to the genre of critique): She doesn't directly rebut the "no model was ever blocked" claim. Instead, she argues evals serve an informational function -- moving understanding from inside labs (where it advantages capabilities progress) to the broader public and policymakers. "Security by obscurity about what models can do is not a great safety strategy."

Empiricrafting (methodological critique): METR's human baselines may be inflated (repo maintainers complete tasks 5-18x faster than contract baseliners). Algorithmic scoring via test cases "dramatically overestimates actual capability" (38% test pass rate but 0% mergeable PRs in one analysis).

Jonas Moss (statistical critique): METR's data "can't distinguish between trajectories" -- the uncertainty is too large to differentiate between linear, polynomial, and exponential growth. The specific 7-month doubling time is less certain than presented.

Zvi Mowshowitz (SFF recommender): Rates METR "High Confidence" for quality of work, but notes no funding gap exists post-Audacious.

Sean Goedecke (external reviewer): Praises the developer productivity RCT as "really good" -- the gold standard methodology and willingness to publish counterintuitive results demonstrates intellectual honesty.

What's Absent

No model has ever been blocked, delayed, or constrained as a result of METR's evaluations. Every model METR has evaluated has been deployed.

No public board of directors or governance structure is visible. For a $17M+ nonprofit conducting critical safety evaluations, this is unusual.

No record of METR's recommendations being followed or overruled by any lab. The entire recommendation-response chain is invisible due to NDAs.

No information about staff departures. No 990 financial filings. No published assessment of whether RSPs have changed lab behavior.

Barnes says METR plans to move into alignment and control evaluations, but as of March 2026 has only published capability evaluations.

Stated Theory of Change

METR's stated theory of change has two pillars:

Pillar 1: Measurement. Create rigorous, human-calibrated benchmarks that produce a meaningful Y-axis for AI capability progress. Specifically, measure how long a task an AI can autonomously complete (the Time Horizons benchmark), and from this derive when models approach dangerous capability thresholds (rogue replication, AI R&D automation, strategic sabotage). If you can measure it, you can see it coming.

Pillar 2: Governance. Use these measurements to inform evaluation-based governance regimes. RSPs (Responsible Scaling Policies) gate further capability scaling on concrete evaluation results. Labs commit in advance to specific protective measures at specific capability levels. External evaluators like METR verify whether thresholds have been crossed.

The causal chain: rigorous measurement -> early warning -> informed policy -> protective action before catastrophe.

Revealed Theory of Change

What METR's actions reveal is subtly but importantly different from the stated theory:

What they actually do. METR produces excellent scientific measurements of AI capability. These measurements are widely cited, influence public discourse, and have shaped how the field thinks about AI progress (the Time Horizons graph is arguably the single most influential visualization in AI safety). METR also produces counterintuitive findings (developer slowdown, reward hacking) that add genuine epistemic value.

What they don't do. METR does not enforce anything. No model has ever been blocked, delayed, or constrained as a result of METR's evaluations. METR operates under lab-controlled NDAs, with lab-controlled access timelines and lab-controlled terms. Barnes admits this openly: "I wouldn't want to describe any of the things that we've done thus far as actually providing meaningful oversight."

The real theory of change is informational, not interventional. METR's actual impact is through shifting the Overton window -- making AI risk legible, quantifiable, and undeniable. The Time Horizons graph tells policymakers and the public "this is happening, this is how fast, here is when you should be scared." The RSP framework gives labs a template for what responsible behavior looks like. But METR has no mechanism to compel compliance.

The gap. The stated theory assumes a world where RSPs have binding force and evaluations trigger protective action. The revealed theory operates in a world where labs voluntarily cooperate (or don't), NDAs prevent public disclosure of concerns, and the word "partner" appears on METR's website describing its relationship with the companies it's supposed to oversee.

Key Assumptions

A1: AI capabilities will continue following predictable scaling trends. If time horizons keep doubling every 4-7 months, METR's benchmarks provide early warning. If capabilities progress in discontinuous jumps (e.g., a sudden architectural breakthrough), METR's extrapolation-based framework would miss the transition.

Evidence for: 6+ years of remarkably consistent doubling in time horizons; R-squared of 0.98.
Evidence against: Jonas Moss argues the data can't distinguish between growth trajectories. The shift from 7-month to 4.3-month doubling suggests the trend itself is unstable.
Testable: Yes. METR publishes projections that can be checked against future models.

A2: Evaluation results will eventually influence deployment decisions. METR's theory requires that someone -- labs, regulators, or the public -- acts on evaluation results to prevent dangerous deployment.

Evidence for: RSPs have been adopted by 9+ labs. Regulatory interest exists (EU AI Office, UK AISI).
Evidence against: No model has ever been blocked. Labs control access and timing. SB 1047 (which would have required evaluation-based transparency) was vetoed.
Testable: Yes. If a model is blocked or delayed due to evaluation results in the next 2 years, this assumption is validated.

A3: Independent evaluation adds value over internal lab evaluation. Labs also run capability evaluations internally. METR's value depends on independence providing either (a) better evaluation quality, (b) credibility that labs can't provide themselves, or (c) information that labs would suppress.

Evidence for: Barnes says METR's elicitation is "as good or a bit better" than labs. METR publishes findings (like reward hacking in o3) that labs might prefer to downplay.
Evidence against: Barnes also says "companies have generally found that often... our stuff is... not dramatic in either direction" compared to their internal evaluations. If METR's measurements roughly match what labs already know, the value is primarily credibility, not information.
Testable: Partially. If METR ever catches a dangerous capability that internal evaluations missed, that would be strong validation.

A4: METR can maintain independence despite structural dependencies. METR depends on labs for compute credits and model access. Its staff come from and occasionally go to labs. Its evaluations operate under lab-controlled NDAs.

Evidence for: No-cash-from-labs policy. Diverse foundation funding. Barnes' candid public criticism of lab behavior.
Evidence against: OpenAI comms/legal approved the GPT-5 report. NDAs prevented disclosure. Lab access is controlled and sometimes limited (o3 evaluation got "relatively little time"). Alfour argues structural dependence makes real independence impossible.

Strengths

Scientific rigor. METR's evaluation methodology -- human-calibrated baselines, diverse task suites, careful attention to elicitation -- is genuinely best-in-class. The developer productivity RCT demonstrates commitment to empirical methods even when results are unflattering.

Measurement innovation. The Time Horizons framework solved a real problem: creating a meaningful, intuitive Y-axis for AI capability that non-experts can understand. "This AI can do a task that takes a human 2 hours" is far more useful than "this model scores 87% on MMLU."

Policy influence. RSPs are now the de facto governance framework for frontier AI labs. Even if implementation is imperfect, the conceptual contribution -- gate scaling on demonstrated safety -- is significant and durable.

Intellectual honesty. Barnes is remarkably candid in interviews. She volunteers weaknesses ("we haven't provided meaningful oversight"), acknowledges uncertainty ("50/50 on things going well"), and publishes findings that hurt her own narrative (the developer slowdown). This level of honesty is rare in organizations that depend on public credibility.

Field building. Open-source tools (Task Standard, HCAST), data releases (MALT dataset), and engagement with government evaluators (UK AISI, EU AI Office) create genuine public goods.

Weaknesses and Risks

No enforcement mechanism. METR can measure, inform, and recommend. It cannot prevent, delay, or block. This is the core weakness. As Alfour argues, evaluation without enforcement is like a fire alarm with no fire department -- it tells you the building is burning but can't put out the fire.

Structural capture risk. METR's relationship with labs looks more like partnership than oversight. Labs control access, timing, and NDA terms. METR's "About" page proudly lists lab partnerships. The compute dependency creates soft leverage. The revolving-door personnel connections (Barnes from OpenAI/DeepMind; Cotra from Coefficient Giving; UK AISI leaders from OpenAI/DeepMind) are standard for this small field but undermine credibility with outsiders.

Governance opacity. No visible board of directors. No published financials. No transparent process for how METR decides which evaluations to accept, on what terms, or what to publish. For an organization claiming to be an independent auditor, this is a significant gap.

Theory of change depends on others. METR's theory of change requires that (a) regulators pass binding evaluation requirements, (b) labs voluntarily comply with RSP commitments even when commercially costly, or (c) the public applies sufficient pressure based on METR's findings. METR itself does no advocacy, lobbying, or public pressure campaigns. It depends entirely on others to close the loop between information and action.

Narrowness of measurement. Time Horizons measure autonomous software engineering capability. They don't directly measure deception, manipulation, strategic planning, or the kind of misalignment that makes AI dangerous. METR acknowledges this and is beginning alignment/control evaluations, but as of March 2026 these are preliminary.

Cross-References

Apollo Research focuses on scheming and deceptive behavior evaluation -- complementary to METR's capability focus. Together they cover more threat surface than either alone.

UK AISI / US AISI are government evaluators. METR positioned itself as the research/prototyping arm that develops methods for government evaluators to eventually adopt at scale. With US AISI's future uncertain under Trump administration, METR may need to absorb some of this function.

Redwood Research is Barnes' top pick for productive independent safety research. She claims Redwood "despite being two people, has really high output compared to the entire rest of the field, and all of the people at labs."

ARC (Alignment Research Center) is METR's parent org (METR spun out from ARC Evals). Paul Christiano founded ARC and his intellectual framework heavily influenced METR's approach.

ControlAI / PauseAI represent the advocacy/enforcement side that METR explicitly avoids. Alfour's critique essentially says METR should be more like ControlAI.

What Would Change This Assessment

A model is blocked or delayed due to METR evaluation. This would validate the entire theory of change. Currently unmet.
Binding regulation requiring external evaluation. If a major jurisdiction mandates third-party evaluation with teeth, METR's value increases dramatically.
METR publishes findings against a lab's wishes. If METR breaks an NDA or refuses to accept one, that would demonstrate genuine independence. If METR continues to operate within lab-controlled terms, the capture critique grows stronger.
Lab revokes METR's access. If a lab cuts off METR after an unfavorable finding, that would paradoxically increase METR's credibility (proving they were independent) while reducing their ability to function.
Alignment evaluation breakthroughs. If METR successfully develops evaluations that detect misalignment (not just capabilities), this would address the deepest weakness in their approach.
Discontinuous capability jump. If a model suddenly jumps past METR's thresholds without following the scaling trend, it would reveal the limits of extrapolation-based early warning.

Self-Critique

What's weakest in this analysis:

Financial picture is necessarily speculative. Without 990 data, all budget estimates are guesses from grant amounts.
The "no model has been blocked" critique may be unfair -- METR is young, and the regulatory environment hasn't caught up. But it's been 4 years since METR's first evaluation (GPT-4), which is enough time to see some impact.
This analysis may underweight METR's influence on internal lab behavior in ways that aren't publicly visible (e.g., RSPs may have caused labs to invest more in safety even without blocking models).

What a thoughtful critic would say:

"You're judging METR by an interventionist standard it explicitly rejects. METR says it does science, not advocacy. Judge it as a research organization and it looks excellent."
"The small-world critique (everyone knows everyone) applies to the entire AI safety field. Singling out METR is unfair."

What would most change my view:

Evidence that METR's evaluations privately influenced a specific lab decision (e.g., delayed a release, triggered additional safety work). This would show the NDA-hidden impact is real.
Evidence that the RSP framework has measurably changed lab behavior beyond superficial adoption (e.g., a lab hit a threshold and paused scaling as committed).

Connected to (12)

European AI Officecollaborator UK AI Security Institutecollaborator

Coefficient Givingstaff from · Ajeya Cotra

RAND Corporationcollaborator

US AI Safety Institutecollaborator

Alignment Research Centerspun off from · Beth Barnes

Anthropicevaluates

Apart Researchcollaborator

Apollo Researchcollaborator

Google DeepMindstaff from · Beth Barnes

OpenAIevaluates

OpenAIstaff from · Beth Barnes

Sources (76)

Every URL that was read during research.

1.METR - Wikipediaen.wikipedia.org
2.METRmetr.org
3.Beth Barnes on the most important graph in AI right now — and the 7-month rule that governs its progress | 80,000 Hours80000hours.org
4.ARC Evals is now METRmetr.org
5.ARC Evals is spinning out from ARCmetr.org
6.Nobody Knows How to Safety-Test AItime.com
7.About METRmetr.org
8.Researchmetr.org
9.Against the METR graphtransformernews.ai
10.This is the most misunderstood graph in AItechnologyreview.com
11.Are We There Yet? Evaluating METR’s Eval of AI’s Ability to Complete Tasks of Different Lengthsempiricrafting.substack.com
12.Emma Abele is METR’s new Executive Directormetr.org
13.34 - AI Evaluations with Beth Barnesaxrp.net
14.New Support Through The Audacious Projectmetr.org
15.New report: Evaluating Language-Model Agents on Realistic Autonomous Tasksmetr.org
16.Details about METR’s evaluation of OpenAI GPT-5evaluations.metr.org
17.The Audacious Project reveals its 2024 cohortblog.ted.com
18.Research Update: Algorithmic vs. Holistic Evaluationmetr.org
19.Measuring AI Ability to Complete Long Tasksmetr.org
20.Time Horizon 1.1metr.org
21.How Does Time Horizon Vary Across Domains?metr.org
22.Task-Completion Time Horizons of Frontier AI Modelsmetr.org
23.Unknownmetr.org
24.A new Moore's Law for AI agents - AI Digesttheaidigest.org
25.Elizabeth (Beth) Barnesbarnes.page
26.Beth Barnesmetr.org
27.AI models can be dangerous before public deploymentmetr.org
28.Unknownarxiv.org
29.METR (formerly called ARC Evals)givingwhatwecan.org
30.Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivitymetr.org
31.Evaluating frontier AI R&D capabilities of language model agents against human expertsmetr.org
32.Canary: A collaboration between METR and RAND | The Audacious Projectaudaciousproject.org
33.We are Changing our Developer Productivity Experiment Designmetr.org
34.METR's AI productivity study is really goodseangoedecke.com
35.The Rogue Replication Threat Modelmetr.org
36.Recent Frontier Models Are Reward Hackingmetr.org
37.Early work on monitorability evaluationsmetr.org
38.Five lessons from having helped run an AI-Biology RCTmetr.org
39.Updatesmetr.org
40.Hjalmar Wijkmetr.org
41.Alignment Research Center - Wikipediaen.wikipedia.org
42.Careers at METRmetr.org
43.Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?arxiv.org
44.Ajeya Cotrametr.org
45.Daniel Filanmetr.org
46.METR’s GPT-4.5 pre-deployment evaluationsmetr.org
47.Details about METR’s preliminary evaluation of GPT-4oevaluations.metr.org
48.Chris Paintermetr.org
49.Joel Beckermetr.org
50.I underestimated AI capabilities (again)planned-obsolescence.org
51.Review of the Anthropic Sabotage Risk Report: Claude Opus 4.6metr.org
52.Why AI Evaluation Regimes are badcognition.cafe
53.Protocol FAQsevaluations.metr.org
54.#16: Gabe Alfour on Competing Beliefs About Superintelligenceaipolicypod.substack.com
55.Common Elements of Frontier AI Safety Policies (December 2025 Update)metr.org
56.GitHub - METR/vivaria: Vivaria is METR's tool for running evaluations and conducting agent elicitation research.github.com
57.GitHub - METR/task-standard: METR Task Standardgithub.com
58.Frontier AI Safety Policiesmetr.org
59.The Big Nonprofits Postthezvi.wordpress.com
60.TIME100 AI 2024: Beth Barnestime.com
61.Details about METR’s preliminary evaluation of OpenAI o1-previewevaluations.metr.org
62.Summary of our gpt-oss methodology reviewmetr.org
63.Careersmetr.org
64.Frontier AI safety regulations: A reference for lab staffmetr.org
65.Notes on Scientific Communication at METRmetr.org
66.Responsible Scaling Policies (RSPs)metr.org
67.Details about METR’s preliminary evaluation of Claude 3.5 Sonnetevaluations.metr.org
68.An update on our preliminary evaluations of Claude 3.5 Sonnet and o1metr.org
69.Key Components of an RSPmetr.org
70.Details about METR’s preliminary evaluation of Claude 3.7evaluations.metr.org
71.Expanding External Access To Frontier AI Models For Dangerous Capability Evaluationsarxiv.org
72.Details about METR’s preliminary evaluation of OpenAI’s o3 and o4-minievaluations.metr.org
73.OpenAI partner says it had relatively little time to test the company's o3 AI model | TechCrunchtechcrunch.com
74.Resources for Measuring Autonomous AI Capabilitiesmetr.org
75.MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integritymetr.org
76.CoT May Be Highly Informative Despite “Unfaithfulness”metr.org