Theory of Change
ARC's theory of change is: if you can build mechanistic estimators that deduce properties of neural networks from their structure (rather than just sampling their outputs), you can estimate the probability of catastrophic behavior with far greater precision than black-box testing, enabling reliable adversarial training against rare but dangerous failures.
In ARC's own words: "We're currently working on outperforming random sampling when it comes to understanding neural network outputs. More broadly, we are trying to produce formal mechanistic explanations for neural network behaviors in order to produce robustly aligned systems." (About page)
The specific causal chain:
- Develop the theory of "heuristic explanations" -- formal but sub-proof-level deductive arguments about neural net behavior
- Build estimators that use these explanations to estimate rare-event probabilities (e.g., catastrophe probability of 10^-18) far more efficiently than sampling
- Use these estimators for adversarial training: efficiently search for inputs that trigger dangerous behavior, then train it away
- Additionally, detect anomalous reasoning patterns (mechanistic anomaly detection): if a model's explanation for behaving safely on training data breaks down on a new input, flag it
This theory of change differs from standard mechanistic interpretability in a critical way: ARC does not aim for human-understandable explanations. Their explanations can be as large and incomprehensible as the network itself. The goal is algorithmic understanding, not human understanding.
Founder Paul Christiano assessed the probability of this research program fully succeeding at "10-20 percent" (Dwarkesh Podcast, Oct 2023). He called the project "crazily ambitious" with "a reasonable chance of being impossible."
What They Do
ARC is a ~11-person theoretical research lab in Berkeley that publishes papers and blog posts about the mathematical foundations of mechanistic interpretability. Their work is unusually abstract for an alignment org.
Current research (2025-2026): The "matching sampling principle" (MSP) -- building algorithms that mechanistically estimate neural net outputs at least as well as random sampling. Concrete progress:
- Solved the MSP for random intersections of halfspaces
- Believe they have a solution for random MLPs (proof sketch in progress)
- Substantial progress on two-layer trained MLPs
Landmark earlier work:
- Eliciting Latent Knowledge report (Dec 2021) -- the foundational document
- ELK Prize (2022) -- 197 proposals, $274K in prizes, major community engagement
- Compact Proofs of Model Performance (NeurIPS 2024) -- supported by ARC, bridging to formal verification
- First empirical paper: Low Probability Estimation in Language Models (2024)
Non-research impact (historical):
- ARC Evals evaluated GPT-4 (including the famous TaskRabbit CAPTCHA incident) and Claude before spinning off as METR in Dec 2023
- Proactively returned $1.25M FTX Foundation grant
- Christiano appointed Head of AI Safety at US AISI (Apr 2024)
Current activities beyond research: Running a MATS mentorship stream (Summer 2026, 5 mentors), year-round visiting researcher program, collaboration with Scott Aaronson's group at UT Austin.
Key People
Jacob Hilton (President/Executive Director): PhD in set theory (Leeds), former Jane Street trader, former OpenAI researcher (InstructGPT, WebGPT, TruthfulQA, scaling laws). Joined ARC's theory team early, became president after Christiano's departure. Compensation: $328K (2024).
Paul Christiano (Founder, departed Apr 2024): MIT math, Berkeley PhD, co-inventor of RLHF. Left to become Head of AI Safety at US AISI. TIME 100 AI 2023. 50% p(doom). Founded ARC in 2021 to return to theoretical research after 4 years at OpenAI. Was on Anthropic's Long-Term Benefit Trust. His departure removed ARC's most famous researcher and primary connection to labs and government.
Team: ~5 FT researchers (Hilton, Eric Neyman, Victor Lecomte, George Robinson, Wilson Wu), 3 collaborators (Mike Winer/IAS, Andrea Lincoln, Scott Aaronson/UT Austin), 3 ops staff. Board: Hilton, Buck Shlegeris (Redwood Research), Ben Hoskin. Notably, Holden Karnofsky (former Open Phil co-CEO) was board secretary/treasurer in 2021-2022 and subsequently left.
Departures: Mark Xu (key researcher through 2023, now absent from team page and 2024 990) and David Matolcsi (on 2024 990 at $200K, reportedly wrote about becoming skeptical of ARC's approach) are unexplained departures.
Money and Incentives
Revenue and financial trajectory:
| Year | Contributions | Expenses | Net Assets |
|---|---|---|---|
| 2024 | $5.2M | ~$9.1M | $3.3M |
| 2023 | $15.0M* | $6.3M | $7.1M |
| 2022 | $4.9M | $1.4M | $3.4M |
| 2021 | $476K | $153K | $322K |
*2023 contributions ($15M) exceeds revenue ($10M) due to FTX grant return and possibly METR spinoff adjustments.
2024 is the key number. Post-METR spinoff and post-Christiano departure, ARC received $5.2M in contributions but spent ~$9.1M. Net assets fell from $7.1M to $3.3M. ARC is spending down its reserves.
Funding sources: $1.5M from Coefficient Giving / Open Philanthropy (2022), $1.4M from Future of Life Institute. The Survival and Flourishing Fund (Jaan Tallinn) is the likely dominant funder, but specific amounts are not public. The remaining ~$25M+ in historical contributions comes from other sources.
Business model: Pure nonprofit grant-funded. Zero program revenue, zero product revenue, zero compute credits. No commercial incentives. No fundraising expenses reported.
Independence: Very high. No lab dependency. No government contracts. No product to ship. FTX grant return demonstrates willingness to sacrifice funding for principles. The main structural risk is likely donor concentration around SFF/Tallinn, though this cannot be confirmed without Schedule B data.
Compensation: Top compensation $435K (2024). Total salary+wages $2.8M (2024). Researcher pay in the $200-330K range. Modest for Berkeley, especially compared to industry alternatives these researchers could command.
What Others Say
Leopold Aschenbrenner (Mar 2023): "Paul is the single most respected alignment researcher in most circles... But his research now ('heuristic arguments') is roughly 'trying to solve alignment via galaxy-brained math proofs.' As much as I respect and appreciate Paul, I'm really skeptical of this: basically all deep learning progress has been empirical, often via dumb hacks and intuitions, rather than sophisticated theory." He praised ARC Evals but was skeptical of the theory work. (forourposterity.com)
NIST staff (Mar 2024): When Christiano was appointed to AISI, NIST staff "threatened to resign," fearing his effective altruism ties "could compromise the institute's objectivity and integrity." The appointment proceeded. This illustrates how ARC's founder is perceived outside the safety community: as a brilliant but polarizing figure with "doomer" views. (Slashdot, VentureBeat)
Scott Aaronson (Apr 2025): Working closely with ARC, received Open Philanthropy grant for TCS-for-alignment. Called ARC's research good and encouraged applications. This is a significant endorsement from a mainstream TCS luminary. (scottaaronson.blog)
ARC's own self-assessment: Christiano publicly stated only 10-20% chance of full success. ARC's about page acknowledges their approach "may completely miss strategies that exploit important structure in realistic ML models." They have published an index of obstacles to their research agenda. This transparency is unusual among alignment orgs.
What's Absent
No empirical results on real models. The MSP has been demonstrated on random MLPs and toy settings. There is no evidence it scales to production transformers. This is the fundamental open question.
No lab partnerships post-METR. ARC's connections to labs were through Christiano and ARC Evals, both now gone. It is unclear how their theory would reach deployment without lab adoption.
Funding trajectory is opaque and possibly concerning. 2024 saw revenue halve and reserves drop by $3.8M. No public commentary on fundraising plans or runway.
Former researcher skepticism. David Matolcsi reportedly wrote about becoming skeptical of ARC's approach, but the content (on LessWrong) was not accessible. Mark Xu's departure is also unexplained.
Post-Christiano vision. There is very little public commentary from Hilton on how ARC's direction has evolved (or not) since the founder's departure.
Recommended Reading
Paul Christiano on Dwarkesh Podcast (Oct 2023) -- The most candid source on ARC's intellectual foundations. Christiano explains the research, gives 10-20% success odds, discusses RLHF's limitations, and offers the most detailed public description of how heuristic arguments work. Extraordinarily frank for 3 hours. https://www.dwarkesh.com/p/paul-christiano
"Nobody's on the ball on AGI alignment" by Leopold Aschenbrenner (Mar 2023) -- The strongest public critique of ARC's approach. Calls it "galaxy-brained math proofs" disconnected from empirical ML. Essential counterpoint to ARC's self-presentation. https://www.forourposterity.com/nobodys-on-the-ball-on-agi-alignment/
"Competing with sampling" by Eric Neyman (Nov 2025) -- ARC's current research agenda explained accessibly. Best source for understanding what the matching sampling principle is and why ARC believes it matters. https://www.alignment.org/blog/competing-with-sampling/
"My Trip to the Alignment Research Center" (Dec 2025) -- A visitor's first-hand account of ARC's culture and work. Provides excellent lay explanation of the matching sampling principle through the bridge-builder analogy. https://eregis.github.io/blog/2025/12/15/arc-visit.html