← AI Safety Orgs

EquiStamp

Evals/Testing

Commercial safety evals.

Founded
2023
HQ
Austin, TX
Team
8
Structure
PBC
Model
Product Revenue

Theory of Change

EquiStamp's theory of change is that AI safety research organizations need reliable operational labor to build, run, and maintain evaluations -- and that this labor must be independent of the labs being evaluated. From their website: "We're a labor pool for alignment research. Labs and projects come to us to implement evals, run baselines, annotate data, and handle administrative overhead. The idea is simple: researchers focus on research, and we handle the grunt work."

CEO Chris Canal frames this around a principal-agent problem: AI companies cannot be trusted to grade their own safety because they have financial incentives to pass. Third-party evaluation infrastructure, executed by people who are motivated by safety rather than profit, is a structural necessity. EquiStamp positions itself as that infrastructure -- not the researchers designing the evals, but the engineers implementing them.

The causal chain: EquiStamp provides skilled, flexible labor to evaluation organizations (primarily METR) -> those organizations produce higher-quality, more rigorous evaluations -> those evaluations are used in pre-deployment safety testing of frontier AI models -> this creates accountability and slows unsafe deployment.

What They Do

EquiStamp is a small (~8-person), young (~2.5-year-old) contract services company that builds and operates AI safety evaluations for research organizations. Their concrete output:

METR work (primary client): Co-authored the HCAST benchmark (189 tasks, 563 human baselines, used for pre-deployment evaluation of GPT-4.5, Claude 3.5 Sonnet, DeepSeek V3). Contributed human baselines to RE-Bench (the evaluation underpinning METR's influential "AI capability doubling every 7 months" result). Canal and O'Connell are credited authors on the HCAST paper; four other team members are acknowledged for baseline work.

UK AISI work: Built ControlArena, an open-source library for AI control experiments, in collaboration with UK AI Security Institute and Redwood Research. Team members named in the citation.

EU AI Act work: Subcontractor for "evaluation engineering" in the FAR.AI-led consortium selected for CBRN risk evaluation (Lot 1) and manipulation risk evaluation (Lot 4) under the EU AI Act.

Other clients: Built CVE Bench for Daniel Kang's lab (UIUC). Built "Linux Bench" for Redwood Research (no public documentation found).

Evaluation platform: Operates equistamp.com with API, model scoring, custom evaluations, and anti-gaming features (paraphrased evaluations). No publicly visible adoption. Published a sophisticated evaluation design methodology as a Claude Skill with a 7-dimension rubric.

The labor model is task-based: a core group of regulars plus a broader pool of specialists pick up tasks as available, on flexible schedules. This is essentially a managed contractor pool, not a traditional employer.

Key People

Chris Canal (CEO, Co-Founder): BS Computer Engineering, Northeastern. Former SDE at Amazon Last Mile; founded Wizio (real estate VR, acquired 2019). Based in Austin, TX. Most public-facing -- two podcast appearances, policy writing, Substack posts. His pre-EquiStamp writing reveals genuine engagement with AI safety (the "Omega Protocol" proposed a $5.7B Manhattan Project for alignment; a co-authored policy article analyzed nuclear and CFC policy as models for AI governance). Background is applied engineering, not academic safety research.

Daniel O'Connell (CTO, Co-Founder): UK citizen. Co-author on HCAST, contributor on ControlArena. Minimal public presence. His non-US citizenship blocks the company from SBA loans.

Rob Miles (AI Advisor): PhD, prominent AI safety YouTuber, creator of StampyAI/aisafety.info. EquiStamp grew out of his Discord server. Exact nature of advisory role (equity, board seat, informal?) is not publicly documented.

Team of ~6 additional contributors, several with prior METR experience. Distributed globally (US, UK, Canada, New Zealand).

Money and Incentives

Revenue: $192K accumulated operating capital from client revenue (self-reported, as of Manifund filing). No independent verification exists.

Business model: Pure contract revenue from AI safety organizations. Clients include METR (primary), UK AISI, Redwood Research, Daniel Kang's lab (UIUC), EU AI Office (via FAR.AI consortium). No visible product revenue from the evaluation platform.

External funding: Sought $400K via Manifund through a SAFE agreement with $8M valuation cap, channeled through AISTOF (AI Safety Tactical Opportunities Fund, managed by JueYan Zhang, $30M+ deployed). This is investment (equity conversion rights), not a grant. Zero grants from Coefficient Giving or any other major AI safety funder appear in public records.

Cash flow constraint: Engineers paid weekly; clients pay on Net 30-180 day terms. EU Commission pays every 6 months. This mismatch is the company's primary financial challenge and the reason for the Manifund fundraise. SBA loans were denied because the CTO is a UK citizen.

Client concentration: METR appears to be the dominant client. Total client count is approximately 5 organizations. Loss of the METR relationship would likely be existential.

Incentive analysis: EquiStamp's financial dependence on a small number of safety organizations creates the same structural conflict it claims to solve. Canal argues third-party evals combat the principal-agent problem between labs and evaluators -- but EquiStamp itself faces a version of this problem: it is paid by the orgs whose evals it implements. If EquiStamp discovered fundamental flaws in METR's methodology, there would be financial pressure not to publicize this. The PBC structure provides legal cover to prioritize safety over profit, but enforcement is self-policed.

Scale context: $192K total capital for ~2.5 years of operation, with ~8 contributors, suggests either very small contract values, very part-time contributors, or both. For comparison, METR had $10M+ revenue in 2023.

What Others Say

Structural validation (Stanford HAI, 2024): A workshop on third-party AI evaluations identified exactly the problems EquiStamp tries to solve -- financial dependence of evaluators, lack of legal protections, no standardization. "Companies write their own tests and they grade themselves."

Evals ecosystem critique (AI Lab Watch, 2024): Labs have not meaningfully granted external evaluators pre-deployment access since GPT-4/Claude 2 in early 2023. If labs are not giving evaluators real access, the evaluation infrastructure EquiStamp builds may not serve as actual safety gates.

Evals readiness critique (TIME, 2024): METR's own policy director acknowledges "the evals are not ready" for the role they are expected to play. Connor Leahy of Conjecture warns that evaluation work could be complicit in "safetywashing."

METR methodology critique (Empiricrafting, 2025): Technical criticism of METR's flagship "doubling every 7 months" evaluation methodology. The benchmark is almost entirely software engineering tasks (not disclosed in the abstract), human baselines have a 61% success rate raising questions about reliability, and the "messiness" of tasks is low compared to real work. Since EquiStamp's labor feeds into this methodology, the critique applies indirectly.

Funding structure critique (Manifund): Commenters criticized the SAFE arrangement as a "business loan disguised as grant," questioning whether a for-profit investment belongs on a charitable funding platform.

Defense (Beth Barnes, METR founder): Positions evaluation work as "a pragmatic step that improves things in the absence of a moratorium." Acknowledges evals are not the ideal solution but argues it is better for companies to publish flawed safety plans than none at all.

What's Absent

  • Zero presence on LessWrong, Alignment Forum, or EA Forum. The org is invisible to the AI safety community's primary discussion platforms.
  • No public criticism of EquiStamp's actual work quality. No independent assessment of whether their evaluation engineering is good or bad.
  • No public financials beyond self-reported $192K. No 990 filings (for-profit). No independent financial verification.
  • No documented team departures or public testimonials about working at EquiStamp.
  • No comparison with alternative evaluation contractors or analysis of EquiStamp's competitive position.
  • Evaluation platform (equistamp.com) shows no visible external adoption or usage.
  • Full podcast transcripts unavailable, leaving the most candid sources on Canal's thinking only partially captured.

Recommended Reading

  1. Crazy Wisdom Podcast #425 (44 min, Jan 2025) -- Canal discusses evals, principal-agent dilemmas, and ethical boundaries. Most unfiltered view of how EquiStamp thinks. https://podcasts.apple.com/gb/podcast/episode-425-agents-evals-and-the-future-of-ai/id1354589767?i=1000683477445

  2. Empiricrafting: "Are We There Yet?" (Dec 2025) -- Detailed technical critique of the METR evaluation methodology that EquiStamp helps build. The strongest available counterargument to EquiStamp's ecosystem. https://empiricrafting.substack.com/p/are-we-there-yet-evaluating-metrs

  3. HCAST paper (arxiv 2503.17354) -- EquiStamp's primary work product. Read the author contributions (Appendix A) and acknowledgments (Appendix B) to understand their actual contribution. https://arxiv.org/abs/2503.17354

  4. Stanford HAI: Third Party Evaluations (Nov 2024) -- Structural analysis of the evaluation ecosystem EquiStamp operates in, including funding independence challenges. https://hai.stanford.edu/news/strengthening-ai-accountability-through-better-third-party-evaluations

  5. TIME: "Nobody Knows How to Safety-Test AI" (Mar 2024) -- Context on how even evaluation proponents acknowledge the methodology is immature. https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/

Show Claude’s analysis
An opinionated read. Read the brief first to form your own view.

Stated Theory of Change

EquiStamp claims that AI safety evaluations suffer from a structural principal-agent problem: the companies building dangerous AI systems are grading their own safety. Third-party evaluation is the solution, but evaluation requires operational labor -- building tasks, running human baselines, implementing scoring infrastructure, managing QA. Safety researchers should focus on research design; EquiStamp handles the engineering.

The mechanism: EquiStamp provides a pool of safety-motivated engineers who can be rapidly deployed to evaluation projects. By being a Public Benefit Corporation independent of any lab, they avoid the conflicts of interest that would arise if labs did this work in-house. The result: higher-quality, more rigorous evaluations that create real accountability for frontier AI development.

Revealed Theory of Change

EquiStamp's actions are closely aligned with their stated theory. They actually do the work they claim to do -- HCAST authorship, ControlArena engineering, RE-Bench baselines, EU consortium participation. There is no significant gap between stated and revealed theory.

However, the revealed theory has a subtle but important narrowing. EquiStamp's stated theory positions them as serving the broad evaluation ecosystem. In practice, they are primarily a METR contractor. The relationship is close enough that Canal and O'Connell are credited authors on METR's flagship paper. This is less "independent third-party evaluation infrastructure" and more "METR's preferred outsourced engineering team." This is not bad -- METR is doing important work, and helping METR do it better is genuinely valuable. But it is a narrower theory of change than the stated one suggests.

The evaluation platform (equistamp.com) hints at a broader ambition -- a product play where anyone can manage models, run evaluations, and compare scores. But there is no evidence this product has found traction. The revenue is entirely from bespoke contracts, not platform usage.

Canal's pre-EquiStamp writing (the Omega Protocol, the policy article) reveals someone who thinks big and cares genuinely about AI safety. EquiStamp is the pragmatic implementation of that care -- not the grand plan, but the "what can I actually build right now to help" version. This is commendable.

Key Assumptions

1. Evaluations are a meaningful safety mechanism.

  • Evidence for: Evaluations are embedded in Anthropic's RSP, OpenAI's Preparedness Framework, and the EU AI Act. METR's work has influenced major policy decisions. Evaluations at least create some accountability.
  • Evidence against: METR's own policy director says "the evals are not ready." AI Lab Watch documents that labs stopped giving external evaluators meaningful access after 2023. Holly Elmore and others argue evals are "just asking the model stuff" and cannot detect the most dangerous capabilities (deception, power-seeking). The evaluation paradigm may provide false assurance that enables more aggressive deployment.
  • Testable? Partially -- if evaluations fail to detect a dangerous capability that is later discovered in deployment, that would be strong evidence against.
  • If wrong: EquiStamp's work is still useful (human baselines and benchmark infrastructure have academic/scientific value) but its safety impact is near zero.

2. There is sustained demand for evaluation labor.

  • Evidence for: EU AI Act creates regulatory demand for CBRN evaluations. UK AISI is scaling up. METR's evaluation workload is growing as more models need pre-deployment assessment. The "doubling every 7 months" result has made evaluation work politically important.
  • Evidence against: Labs could bring evaluation work in-house (and have economic incentives to do so). Government AI safety institutes could build their own teams. The evaluation market is young and could contract if regulatory enthusiasm fades.
  • Testable? Yes -- whether evaluation spending grows or shrinks in the next 2-3 years will be observable.
  • If wrong: EquiStamp loses its revenue base and cannot sustain itself.

3. A task-based flexible labor pool is the right organizational model.

  • Evidence for: Safety researchers often need burst capacity around publication deadlines and pre-deployment evaluation windows. A flexible pool can scale up and down. The model attracts people who want to contribute part-time while having other commitments.
  • Evidence against: Evaluation quality may suffer with transient contributors who lack deep context on any single project. Deep institutional knowledge (crucial for detecting subtle benchmark problems) is hard to build with a contractor pool. The task-based model may produce commodity labor rather than expert evaluation engineering.
  • Testable? Indirectly -- through the quality of EquiStamp's output as assessed by clients and independent reviewers.
  • If wrong: EquiStamp's work is lower quality than it could be with a smaller, more dedicated permanent team.

4. For-profit PBC is a viable structure for this work.

  • Evidence for: PBC status creates legal obligation to consider safety alongside profit. For-profit structure allows equity fundraising (SAFE via AISTOF) and contract revenue, avoiding dependence on grant cycles. The structure is honest about the commercial nature of the work.
  • Evidence against: For-profit status excludes EquiStamp from most philanthropic AI safety funding (no CG grants). PBC enforcement is self-policed. Equity investors may eventually push for profit maximization that conflicts with safety priorities. The "business loan disguised as grant" criticism reflects community discomfort with for-profit entities in the AI safety funding ecosystem.
  • Testable? Over time -- whether equity investors influence EquiStamp's client selection or work priorities.
  • If wrong: EquiStamp may face pressure to take non-safety work to achieve profitability, diluting their mission.

Strengths

Concrete, verifiable output. EquiStamp has co-authored a major evaluation paper, contributed to UK government safety infrastructure, and been selected for EU regulatory work. For a 2.5-year-old, ~8-person company, this is a strong track record of shipping real work.

Niche fit. The "grunt work" positioning is genuinely useful. METR researchers designing evaluations should not also be spending their time on task QA, human baselining logistics, and infrastructure maintenance. EquiStamp provides specialization.

Methodological sophistication. The 7-dimension evaluation rubric (published as a Claude Skill) demonstrates that EquiStamp thinks carefully about what makes evaluations good. This meta-level thinking about evaluation quality is more intellectually valuable than merely executing evaluations.

Genuine safety motivation. Canal's writing shows real engagement with AI safety ideas. Declining parasocial AI and scam-related projects shows willingness to sacrifice revenue for values. The PBC structure formalizes this commitment.

International reach. Work with METR (US), UK AISI (UK), and the EU AI Office shows EquiStamp can operate across the three major AI governance jurisdictions.

Weaknesses and Risks

Client concentration. METR appears to be the dominant revenue source. If METR hired in-house engineers, changed contractors, or reduced its evaluation workload, EquiStamp could face an existential financial crisis.

Financial precarity. $192K accumulated capital after 2.5 years of operation, with cash flow mismatches requiring bridge financing, is extremely fragile. One delayed payment or lost contract could be fatal.

Minimal governance. No disclosed board of directors, no advisory board, no governance structure beyond two co-founders. For a company that aspires to be "independent evaluation infrastructure," the lack of governance is a weakness.

Invisible to the community. Zero forum posts, no community engagement, no blog posts discussing methodology. This means no external peer review of their work and no community accountability. EquiStamp is building safety infrastructure without participating in the safety community's discourse.

The evals-as-safety-gates assumption. EquiStamp's value depends entirely on evaluations being a meaningful safety mechanism. If labs continue to not grant external evaluators real pre-deployment access (as documented by AI Lab Watch), and if evaluations remain academic benchmarks rather than safety gates, then EquiStamp is building infrastructure for a system that doesn't actually protect anyone.

Labor pool quality control. The task-based contractor model raises questions about depth of expertise. Building truly excellent evaluation infrastructure requires deep understanding of benchmark design, statistical methodology, and adversarial testing -- not just engineering execution. It is unclear whether EquiStamp's model produces the depth of expertise needed.

Cross-References

METR: EquiStamp's primary client and the org whose work gives EquiStamp legitimacy. The relationship is deeply symbiotic -- METR needs operational labor, EquiStamp needs contracts. EquiStamp should be evaluated partly through the lens of METR's theory of change.

Haize Labs: Another young eval-focused company, but more oriented toward red-teaming and adversarial testing of deployed systems. Haize and EquiStamp occupy adjacent but non-overlapping niches -- Haize focuses on finding vulnerabilities in deployed models, EquiStamp on building and running structured evaluation benchmarks.

Apollo Research: Focused on AI deception and scheming evaluations. Operates at a more research-oriented level than EquiStamp, designing evaluations rather than providing the engineering labor to implement them. Apollo is a potential EquiStamp client, not a competitor.

FAR.AI: Selected EquiStamp as a subcontractor for EU AI Act work. FAR.AI is a larger research nonprofit that leads the consortium; EquiStamp provides specialized evaluation engineering.

UK AISI: Government agency for which EquiStamp built ControlArena. Represents the institutional demand side -- governments need evaluation infrastructure and are willing to contract with small safety-focused companies to build it.

What Would Change This Assessment

Upward: (1) EquiStamp publishes independent research or analysis (not just as METR subcontractors). (2) Labs begin granting genuine pre-deployment access to third-party evaluators and EquiStamp plays a role. (3) The evaluation platform gains real external users. (4) EquiStamp diversifies its client base significantly (>5 clients, no single client >40% of revenue). (5) Canal or team members publish substantive forum posts engaging with community critiques of evaluation methodology.

Downward: (1) METR brings evaluation engineering in-house or switches to another contractor. (2) EquiStamp fails to secure follow-on financing and becomes unable to bridge cash flow gaps. (3) A major evaluation failure reveals quality problems in EquiStamp's work product. (4) The "evals as safety gates" paradigm collapses as critics argue persuasively that evaluations provide false assurance. (5) EquiStamp takes on non-safety clients to survive financially.

Self-Critique

Weakest claim: My assessment that EquiStamp's work is "genuinely useful" rests on accepting the premise that evaluation infrastructure matters for safety. If the "safetywashing" critique is correct -- that evaluations provide false assurance enabling more aggressive deployment -- then EquiStamp's work could be net negative. I have not given this possibility sufficient weight because the evidence base makes it hard to adjudicate.

Missing sources: The two podcast transcripts (3.5 hours of muckrAIkers, 44 minutes of Crazy Wisdom) are likely the single most informative sources about Canal's thinking and EquiStamp's strategy. Only show notes were available. The Manifund project page, which contains the most detailed financial and strategy information, could not be fully fetched due to rate limiting. Both gaps significantly limit the analysis.

Potential bias: I may be giving EquiStamp too much credit for being small and scrappy. The "earnest startup doing good work on a shoestring" narrative is appealing but might obscure the question of whether the work actually matters at the scale it operates. A company with $192K in total capital and ~8 part-time contributors is not going to meaningfully shape the AI safety evaluation ecosystem regardless of how well-intentioned it is.

What a smart critic would say: "EquiStamp is basically a staffing agency for METR. It has no independent research agenda, no original contributions to evaluation methodology beyond a Claude Skill, and its existence is entirely dependent on whether a handful of safety orgs continue to outsource engineering work. If METR hires two more engineers, EquiStamp has no reason to exist."

What information would most change my view: The full Manifund page (including comments and EquiStamp's detailed financial breakdown) and the full podcast transcripts. Also: any evidence about the quality of EquiStamp's work relative to alternatives -- do clients view them as excellent, adequate, or simply available?

Connected to (8)

Apart Researchcollaborator
FAR.AIcollaborator
AISTOFcollaborator · JueYan Zhang
METRstaff from · Amy Ngo
Redwood Researchcollaborator
UK AI Security Institutecollaborator · Daniel O'Connell
METRcollaborator · Chris Canal
StampyAIspun off from · Rob Miles
Sources (37)
Every URL that was read during research.
  1. 1.Research Operations for AI Safetyequistamp.com
  2. 2.Research Operations for AI Safetyequistamp.com
  3. 3.Understanding AI World Models w/ Chris Canal | Kairos.fmkairos.fm
  4. 4.JavaScript is not available.x.com
  5. 5.JavaScript is not available.x.com
  6. 6.Research Operations for AI Safetyequistamp.com
  7. 7.Aboutequistamp.com
  8. 8.Modelsequistamp.com
  9. 9.Evaluation Sessionsequistamp.com
  10. 10.HCAST: Human-Calibrated Autonomy Software Tasksarxiv.org
  11. 11.HCAST: Human-Calibrated Autonomy Software Tasksarxiv.org
  12. 12.EquiStampgithub.com
  13. 13.Episode #425: Agents, Evals, and the Future of AI: A Pragmatic Take with Christopher Canalpodcasts.apple.com
  14. 14.Better Systems | Christopher Canal | Substackchriscanal.substack.com
  15. 15.The Omega Protocol : Another Manhattan Projectchriscanal.substack.com
  16. 16.ControlArenacontrol-arena.aisi.org.uk
  17. 17.A Plan to Fund Independent Assessments of General-Purpose AItechpolicy.press
  18. 18.Strengthening AI Accountability Through Better Third Party Evaluationshai.stanford.edu
  19. 19.GitHub - StampyAI/stampy: A Discord bot for the Robert Miles AI servergithub.com
  20. 20.GitHub - EquiStamp/CFAR-Claude-Skills: A repository of claude skills that provide practice for CFAR techniquesgithub.com
  21. 21.Learning from History: Preventing AGI Existential Risks through Policyalltrades.substack.com
  22. 22.Equistamp Management Team | Org Chartrocketreach.co
  23. 23.Nobody Knows How to Safety-Test AItime.com
  24. 24.AI companies aren't really using external evaluatorsailabwatch.org
  25. 25.AI Safety in Austin — Happy Hour · Lumaluma.com
  26. 26.Equistamp | Substackequistamp.substack.com
  27. 27.RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human expertsarxiv.org
  28. 28.Are We There Yet? Evaluating METR’s Eval of AI’s Ability to Complete Tasks of Different Lengthsempiricrafting.substack.com
  29. 29.Check a website at Scamadviser.com for a trust scorescamadviser.com
  30. 30.Equistamp 0.0.1equistamp.com
  31. 31.FAR.AI Selected to Lead EU AI Act CBRN Risk Consortiumfar.ai
  32. 32.EU AI Office Launches €9 million tender for Technical Support on GPAI Safetydigital-strategy.ec.europa.eu
  33. 33.evaluation-design | Skills Marketplacelobehub.com
  34. 34.Low effort podcast #3: How I really feel about AI evalshollyelmore.substack.com
  35. 35.The AI Safety Debate Needs AI Skepticstechpolicy.press
  36. 36.Evaluationsequistamp.com
  37. 37.DSLequistamp.com