Theory of Change
EquiStamp's theory of change is that AI safety research organizations need reliable operational labor to build, run, and maintain evaluations -- and that this labor must be independent of the labs being evaluated. From their website: "We're a labor pool for alignment research. Labs and projects come to us to implement evals, run baselines, annotate data, and handle administrative overhead. The idea is simple: researchers focus on research, and we handle the grunt work."
CEO Chris Canal frames this around a principal-agent problem: AI companies cannot be trusted to grade their own safety because they have financial incentives to pass. Third-party evaluation infrastructure, executed by people who are motivated by safety rather than profit, is a structural necessity. EquiStamp positions itself as that infrastructure -- not the researchers designing the evals, but the engineers implementing them.
The causal chain: EquiStamp provides skilled, flexible labor to evaluation organizations (primarily METR) -> those organizations produce higher-quality, more rigorous evaluations -> those evaluations are used in pre-deployment safety testing of frontier AI models -> this creates accountability and slows unsafe deployment.
What They Do
EquiStamp is a small (~8-person), young (~2.5-year-old) contract services company that builds and operates AI safety evaluations for research organizations. Their concrete output:
METR work (primary client): Co-authored the HCAST benchmark (189 tasks, 563 human baselines, used for pre-deployment evaluation of GPT-4.5, Claude 3.5 Sonnet, DeepSeek V3). Contributed human baselines to RE-Bench (the evaluation underpinning METR's influential "AI capability doubling every 7 months" result). Canal and O'Connell are credited authors on the HCAST paper; four other team members are acknowledged for baseline work.
UK AISI work: Built ControlArena, an open-source library for AI control experiments, in collaboration with UK AI Security Institute and Redwood Research. Team members named in the citation.
EU AI Act work: Subcontractor for "evaluation engineering" in the FAR.AI-led consortium selected for CBRN risk evaluation (Lot 1) and manipulation risk evaluation (Lot 4) under the EU AI Act.
Other clients: Built CVE Bench for Daniel Kang's lab (UIUC). Built "Linux Bench" for Redwood Research (no public documentation found).
Evaluation platform: Operates equistamp.com with API, model scoring, custom evaluations, and anti-gaming features (paraphrased evaluations). No publicly visible adoption. Published a sophisticated evaluation design methodology as a Claude Skill with a 7-dimension rubric.
The labor model is task-based: a core group of regulars plus a broader pool of specialists pick up tasks as available, on flexible schedules. This is essentially a managed contractor pool, not a traditional employer.
Key People
Chris Canal (CEO, Co-Founder): BS Computer Engineering, Northeastern. Former SDE at Amazon Last Mile; founded Wizio (real estate VR, acquired 2019). Based in Austin, TX. Most public-facing -- two podcast appearances, policy writing, Substack posts. His pre-EquiStamp writing reveals genuine engagement with AI safety (the "Omega Protocol" proposed a $5.7B Manhattan Project for alignment; a co-authored policy article analyzed nuclear and CFC policy as models for AI governance). Background is applied engineering, not academic safety research.
Daniel O'Connell (CTO, Co-Founder): UK citizen. Co-author on HCAST, contributor on ControlArena. Minimal public presence. His non-US citizenship blocks the company from SBA loans.
Rob Miles (AI Advisor): PhD, prominent AI safety YouTuber, creator of StampyAI/aisafety.info. EquiStamp grew out of his Discord server. Exact nature of advisory role (equity, board seat, informal?) is not publicly documented.
Team of ~6 additional contributors, several with prior METR experience. Distributed globally (US, UK, Canada, New Zealand).
Money and Incentives
Revenue: $192K accumulated operating capital from client revenue (self-reported, as of Manifund filing). No independent verification exists.
Business model: Pure contract revenue from AI safety organizations. Clients include METR (primary), UK AISI, Redwood Research, Daniel Kang's lab (UIUC), EU AI Office (via FAR.AI consortium). No visible product revenue from the evaluation platform.
External funding: Sought $400K via Manifund through a SAFE agreement with $8M valuation cap, channeled through AISTOF (AI Safety Tactical Opportunities Fund, managed by JueYan Zhang, $30M+ deployed). This is investment (equity conversion rights), not a grant. Zero grants from Coefficient Giving or any other major AI safety funder appear in public records.
Cash flow constraint: Engineers paid weekly; clients pay on Net 30-180 day terms. EU Commission pays every 6 months. This mismatch is the company's primary financial challenge and the reason for the Manifund fundraise. SBA loans were denied because the CTO is a UK citizen.
Client concentration: METR appears to be the dominant client. Total client count is approximately 5 organizations. Loss of the METR relationship would likely be existential.
Incentive analysis: EquiStamp's financial dependence on a small number of safety organizations creates the same structural conflict it claims to solve. Canal argues third-party evals combat the principal-agent problem between labs and evaluators -- but EquiStamp itself faces a version of this problem: it is paid by the orgs whose evals it implements. If EquiStamp discovered fundamental flaws in METR's methodology, there would be financial pressure not to publicize this. The PBC structure provides legal cover to prioritize safety over profit, but enforcement is self-policed.
Scale context: $192K total capital for ~2.5 years of operation, with ~8 contributors, suggests either very small contract values, very part-time contributors, or both. For comparison, METR had $10M+ revenue in 2023.
What Others Say
Structural validation (Stanford HAI, 2024): A workshop on third-party AI evaluations identified exactly the problems EquiStamp tries to solve -- financial dependence of evaluators, lack of legal protections, no standardization. "Companies write their own tests and they grade themselves."
Evals ecosystem critique (AI Lab Watch, 2024): Labs have not meaningfully granted external evaluators pre-deployment access since GPT-4/Claude 2 in early 2023. If labs are not giving evaluators real access, the evaluation infrastructure EquiStamp builds may not serve as actual safety gates.
Evals readiness critique (TIME, 2024): METR's own policy director acknowledges "the evals are not ready" for the role they are expected to play. Connor Leahy of Conjecture warns that evaluation work could be complicit in "safetywashing."
METR methodology critique (Empiricrafting, 2025): Technical criticism of METR's flagship "doubling every 7 months" evaluation methodology. The benchmark is almost entirely software engineering tasks (not disclosed in the abstract), human baselines have a 61% success rate raising questions about reliability, and the "messiness" of tasks is low compared to real work. Since EquiStamp's labor feeds into this methodology, the critique applies indirectly.
Funding structure critique (Manifund): Commenters criticized the SAFE arrangement as a "business loan disguised as grant," questioning whether a for-profit investment belongs on a charitable funding platform.
Defense (Beth Barnes, METR founder): Positions evaluation work as "a pragmatic step that improves things in the absence of a moratorium." Acknowledges evals are not the ideal solution but argues it is better for companies to publish flawed safety plans than none at all.
What's Absent
- Zero presence on LessWrong, Alignment Forum, or EA Forum. The org is invisible to the AI safety community's primary discussion platforms.
- No public criticism of EquiStamp's actual work quality. No independent assessment of whether their evaluation engineering is good or bad.
- No public financials beyond self-reported $192K. No 990 filings (for-profit). No independent financial verification.
- No documented team departures or public testimonials about working at EquiStamp.
- No comparison with alternative evaluation contractors or analysis of EquiStamp's competitive position.
- Evaluation platform (equistamp.com) shows no visible external adoption or usage.
- Full podcast transcripts unavailable, leaving the most candid sources on Canal's thinking only partially captured.
Recommended Reading
Crazy Wisdom Podcast #425 (44 min, Jan 2025) -- Canal discusses evals, principal-agent dilemmas, and ethical boundaries. Most unfiltered view of how EquiStamp thinks. https://podcasts.apple.com/gb/podcast/episode-425-agents-evals-and-the-future-of-ai/id1354589767?i=1000683477445
Empiricrafting: "Are We There Yet?" (Dec 2025) -- Detailed technical critique of the METR evaluation methodology that EquiStamp helps build. The strongest available counterargument to EquiStamp's ecosystem. https://empiricrafting.substack.com/p/are-we-there-yet-evaluating-metrs
HCAST paper (arxiv 2503.17354) -- EquiStamp's primary work product. Read the author contributions (Appendix A) and acknowledgments (Appendix B) to understand their actual contribution. https://arxiv.org/abs/2503.17354
Stanford HAI: Third Party Evaluations (Nov 2024) -- Structural analysis of the evaluation ecosystem EquiStamp operates in, including funding independence challenges. https://hai.stanford.edu/news/strengthening-ai-accountability-through-better-third-party-evaluations
TIME: "Nobody Knows How to Safety-Test AI" (Mar 2024) -- Context on how even evaluation proponents acknowledge the methodology is immature. https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/