← AI Safety Orgs

Alignment Research Center (ARC)

Conceptual Research

Paul Christiano. Evals + alignment theory.

Founded
2021
HQ
Berkeley, CA
Team
11
Structure
501(c)(3) nonprofit
Model
Grants

Theory of Change

ARC's theory of change is: if you can build mechanistic estimators that deduce properties of neural networks from their structure (rather than just sampling their outputs), you can estimate the probability of catastrophic behavior with far greater precision than black-box testing, enabling reliable adversarial training against rare but dangerous failures.

In ARC's own words: "We're currently working on outperforming random sampling when it comes to understanding neural network outputs. More broadly, we are trying to produce formal mechanistic explanations for neural network behaviors in order to produce robustly aligned systems." (About page)

The specific causal chain:

  1. Develop the theory of "heuristic explanations" -- formal but sub-proof-level deductive arguments about neural net behavior
  2. Build estimators that use these explanations to estimate rare-event probabilities (e.g., catastrophe probability of 10^-18) far more efficiently than sampling
  3. Use these estimators for adversarial training: efficiently search for inputs that trigger dangerous behavior, then train it away
  4. Additionally, detect anomalous reasoning patterns (mechanistic anomaly detection): if a model's explanation for behaving safely on training data breaks down on a new input, flag it

This theory of change differs from standard mechanistic interpretability in a critical way: ARC does not aim for human-understandable explanations. Their explanations can be as large and incomprehensible as the network itself. The goal is algorithmic understanding, not human understanding.

Founder Paul Christiano assessed the probability of this research program fully succeeding at "10-20 percent" (Dwarkesh Podcast, Oct 2023). He called the project "crazily ambitious" with "a reasonable chance of being impossible."

What They Do

ARC is a ~11-person theoretical research lab in Berkeley that publishes papers and blog posts about the mathematical foundations of mechanistic interpretability. Their work is unusually abstract for an alignment org.

Current research (2025-2026): The "matching sampling principle" (MSP) -- building algorithms that mechanistically estimate neural net outputs at least as well as random sampling. Concrete progress:

  • Solved the MSP for random intersections of halfspaces
  • Believe they have a solution for random MLPs (proof sketch in progress)
  • Substantial progress on two-layer trained MLPs

Landmark earlier work:

  • Eliciting Latent Knowledge report (Dec 2021) -- the foundational document
  • ELK Prize (2022) -- 197 proposals, $274K in prizes, major community engagement
  • Compact Proofs of Model Performance (NeurIPS 2024) -- supported by ARC, bridging to formal verification
  • First empirical paper: Low Probability Estimation in Language Models (2024)

Non-research impact (historical):

  • ARC Evals evaluated GPT-4 (including the famous TaskRabbit CAPTCHA incident) and Claude before spinning off as METR in Dec 2023
  • Proactively returned $1.25M FTX Foundation grant
  • Christiano appointed Head of AI Safety at US AISI (Apr 2024)

Current activities beyond research: Running a MATS mentorship stream (Summer 2026, 5 mentors), year-round visiting researcher program, collaboration with Scott Aaronson's group at UT Austin.

Key People

Jacob Hilton (President/Executive Director): PhD in set theory (Leeds), former Jane Street trader, former OpenAI researcher (InstructGPT, WebGPT, TruthfulQA, scaling laws). Joined ARC's theory team early, became president after Christiano's departure. Compensation: $328K (2024).

Paul Christiano (Founder, departed Apr 2024): MIT math, Berkeley PhD, co-inventor of RLHF. Left to become Head of AI Safety at US AISI. TIME 100 AI 2023. 50% p(doom). Founded ARC in 2021 to return to theoretical research after 4 years at OpenAI. Was on Anthropic's Long-Term Benefit Trust. His departure removed ARC's most famous researcher and primary connection to labs and government.

Team: ~5 FT researchers (Hilton, Eric Neyman, Victor Lecomte, George Robinson, Wilson Wu), 3 collaborators (Mike Winer/IAS, Andrea Lincoln, Scott Aaronson/UT Austin), 3 ops staff. Board: Hilton, Buck Shlegeris (Redwood Research), Ben Hoskin. Notably, Holden Karnofsky (former Open Phil co-CEO) was board secretary/treasurer in 2021-2022 and subsequently left.

Departures: Mark Xu (key researcher through 2023, now absent from team page and 2024 990) and David Matolcsi (on 2024 990 at $200K, reportedly wrote about becoming skeptical of ARC's approach) are unexplained departures.

Money and Incentives

Revenue and financial trajectory:

Year Contributions Expenses Net Assets
2024 $5.2M ~$9.1M $3.3M
2023 $15.0M* $6.3M $7.1M
2022 $4.9M $1.4M $3.4M
2021 $476K $153K $322K

*2023 contributions ($15M) exceeds revenue ($10M) due to FTX grant return and possibly METR spinoff adjustments.

2024 is the key number. Post-METR spinoff and post-Christiano departure, ARC received $5.2M in contributions but spent ~$9.1M. Net assets fell from $7.1M to $3.3M. ARC is spending down its reserves.

Funding sources: $1.5M from Coefficient Giving / Open Philanthropy (2022), $1.4M from Future of Life Institute. The Survival and Flourishing Fund (Jaan Tallinn) is the likely dominant funder, but specific amounts are not public. The remaining ~$25M+ in historical contributions comes from other sources.

Business model: Pure nonprofit grant-funded. Zero program revenue, zero product revenue, zero compute credits. No commercial incentives. No fundraising expenses reported.

Independence: Very high. No lab dependency. No government contracts. No product to ship. FTX grant return demonstrates willingness to sacrifice funding for principles. The main structural risk is likely donor concentration around SFF/Tallinn, though this cannot be confirmed without Schedule B data.

Compensation: Top compensation $435K (2024). Total salary+wages $2.8M (2024). Researcher pay in the $200-330K range. Modest for Berkeley, especially compared to industry alternatives these researchers could command.

What Others Say

Leopold Aschenbrenner (Mar 2023): "Paul is the single most respected alignment researcher in most circles... But his research now ('heuristic arguments') is roughly 'trying to solve alignment via galaxy-brained math proofs.' As much as I respect and appreciate Paul, I'm really skeptical of this: basically all deep learning progress has been empirical, often via dumb hacks and intuitions, rather than sophisticated theory." He praised ARC Evals but was skeptical of the theory work. (forourposterity.com)

NIST staff (Mar 2024): When Christiano was appointed to AISI, NIST staff "threatened to resign," fearing his effective altruism ties "could compromise the institute's objectivity and integrity." The appointment proceeded. This illustrates how ARC's founder is perceived outside the safety community: as a brilliant but polarizing figure with "doomer" views. (Slashdot, VentureBeat)

Scott Aaronson (Apr 2025): Working closely with ARC, received Open Philanthropy grant for TCS-for-alignment. Called ARC's research good and encouraged applications. This is a significant endorsement from a mainstream TCS luminary. (scottaaronson.blog)

ARC's own self-assessment: Christiano publicly stated only 10-20% chance of full success. ARC's about page acknowledges their approach "may completely miss strategies that exploit important structure in realistic ML models." They have published an index of obstacles to their research agenda. This transparency is unusual among alignment orgs.

What's Absent

No empirical results on real models. The MSP has been demonstrated on random MLPs and toy settings. There is no evidence it scales to production transformers. This is the fundamental open question.

No lab partnerships post-METR. ARC's connections to labs were through Christiano and ARC Evals, both now gone. It is unclear how their theory would reach deployment without lab adoption.

Funding trajectory is opaque and possibly concerning. 2024 saw revenue halve and reserves drop by $3.8M. No public commentary on fundraising plans or runway.

Former researcher skepticism. David Matolcsi reportedly wrote about becoming skeptical of ARC's approach, but the content (on LessWrong) was not accessible. Mark Xu's departure is also unexplained.

Post-Christiano vision. There is very little public commentary from Hilton on how ARC's direction has evolved (or not) since the founder's departure.

Recommended Reading

  1. Paul Christiano on Dwarkesh Podcast (Oct 2023) -- The most candid source on ARC's intellectual foundations. Christiano explains the research, gives 10-20% success odds, discusses RLHF's limitations, and offers the most detailed public description of how heuristic arguments work. Extraordinarily frank for 3 hours. https://www.dwarkesh.com/p/paul-christiano

  2. "Nobody's on the ball on AGI alignment" by Leopold Aschenbrenner (Mar 2023) -- The strongest public critique of ARC's approach. Calls it "galaxy-brained math proofs" disconnected from empirical ML. Essential counterpoint to ARC's self-presentation. https://www.forourposterity.com/nobodys-on-the-ball-on-agi-alignment/

  3. "Competing with sampling" by Eric Neyman (Nov 2025) -- ARC's current research agenda explained accessibly. Best source for understanding what the matching sampling principle is and why ARC believes it matters. https://www.alignment.org/blog/competing-with-sampling/

  4. "My Trip to the Alignment Research Center" (Dec 2025) -- A visitor's first-hand account of ARC's culture and work. Provides excellent lay explanation of the matching sampling principle through the bridge-builder analogy. https://eregis.github.io/blog/2025/12/15/arc-visit.html

Show Claude’s analysis
An opinionated read. Read the brief first to form your own view.

Stated Theory of Change

ARC claims that the path to safe superintelligent AI requires mechanistic understanding of neural networks -- not at the human-readable level (like standard mechanistic interpretability), but at the level of formal, algorithmic understanding that can be verified computationally.

The specific mechanism: Build estimators that can deduce properties of neural net outputs from their internal structure, achieving accuracy at least as good as random sampling but via deductive rather than inductive reasoning. This enables:

  1. Estimating catastrophe probability far below what sampling allows (e.g., 10^-18 instead of 10^-6)
  2. Adversarial training that efficiently finds and removes dangerous behaviors
  3. Anomaly detection that flags when a model's reason for behaving safely breaks down on new inputs

ARC positions this as complementary to (not competitive with) standard mechanistic interpretability. Their approach handles the "full explanation" needed for formal guarantees; standard interp handles the "partial, human understanding" needed for trust and debugging.

The theory of change implicitly assumes:

  • Alignment will require guarantees beyond what sampling can provide
  • Neural nets have enough structure that mechanistic estimation can beat sampling in practice (not just match it in theory)
  • The mathematical foundations can be worked out before the technology is needed
  • Labs will adopt the resulting techniques

Revealed Theory of Change

ARC's actions are remarkably consistent with their stated theory. This is one of the most intellectually honest orgs in the space.

What they actually do: Theoretical math and computer science research. The team spends its time at whiteboards working on combinatorics problems, formalizing notions of heuristic explanations, and proving things about estimators on toy architectures. This matches the stated agenda with unusual precision.

What they publish: A steady stream of blog posts and papers that build toward the MSP vision. Each publication fits into the "bird's eye view" diagram. There is no evidence of mission drift, commercial side-projects, or capability research.

What they hire for: Theorists -- math PhDs, TCS researchers, theoretical physicists. The hiring page explicitly seeks people with "backgrounds in computer science, math, and theoretical physics." They do not hire ML engineers, product designers, or policy staff.

What they spend money on: Salaries for researchers. No fundraising costs, no PR, no flashy compute bills.

Where stated and revealed diverge:

  • The stated theory mentions "collaborations with industry labs" as a future plan. In practice, post-METR spinoff, ARC has zero visible lab partnerships. The theory-to-practice pipeline is entirely absent.
  • The about page mentions potentially branching into "empirical alignment research, alignment forecasting, and ML deployment policy." None of these have materialized. ARC is more narrowly theoretical than its own aspirations suggest.
  • Christiano spoke repeatedly about the importance of "external pressure on labs." But ARC itself no longer exerts such pressure -- that function left with ARC Evals/METR and Christiano himself.

Key Assumptions

1. Neural networks have sufficient structure for mechanistic estimation to beat sampling in practice.

  • Evidence for: The MSP has been demonstrated on random MLPs and random halfspace intersections. Trained two-layer networks show progress.
  • Evidence against: No results on transformers, attention mechanisms, or anything resembling a production model. The gap between "random MLP" and "GPT-5" is enormous.
  • Testable? Yes, incrementally. Solving MSP for random transformers, then trained transformers, would be strong signals.
  • If wrong: ARC's entire research program fails to produce practical tools.

2. Formal heuristic arguments are mathematically possible and tractable.

  • Evidence for: Christiano's 10-20% estimate implies he thinks it's possible but hard. Progress on the MSP in 2025 suggests tractability on simple cases.
  • Evidence against: No formal definition of "mechanistic" exists yet. The foundational formalism is incomplete.
  • If wrong: This is the existential risk for ARC as an organization. But even partial results could be valuable.

3. There is enough time to develop this theory before it's needed.

  • Evidence for: Christiano's own timelines (40% by 2040 for AGI) give potentially a decade.
  • Evidence against: Many researchers have much shorter timelines. If human-level AI arrives by 2028-2030, ARC's theory is almost certainly not ready.
  • If wrong: ARC's work becomes foundational research for a future that arrives too late to benefit from it. The work could still be valuable if a "slow takeoff" allows iterative adoption.

4. Labs will adopt the techniques if developed.

  • Evidence for: ARC Evals' work was adopted by OpenAI and Anthropic. Christiano has personal relationships with lab leaders.
  • Evidence against: Post-Christiano, post-METR, ARC has no visible lab relationships. Theory adoption is historically very slow in ML. Labs optimize for capabilities, not formal safety.
  • If wrong: ARC produces beautiful mathematics that collects dust.

Strengths

Intellectual seriousness. ARC is among the most intellectually honest orgs in AI safety. Christiano publicly assigns 10-20% success probability. The about page acknowledges their approach might miss important structure. They published obstacles to their own agenda. This is rare.

Founder quality. Christiano is one of the most technically credible people in AI safety -- co-inventor of RLHF, TIME 100 AI, head of US AISI. Even critics like Aschenbrenner call him "the single most respected alignment researcher in most circles." This credibility gives ARC's research more weight than its small size would otherwise warrant.

Unique niche. No one else is doing exactly this work. Standard mechanistic interpretability aims for human understanding; ARC aims for algorithmic understanding. Formal verification aims for proofs; ARC aims for heuristic arguments. This specific middle ground (formal but not proven, mechanistic but not human-readable) is ARC's distinctive contribution.

Recent momentum. The 2025 MSP reorientation appears to have reinvigorated the program. The visitor account, MATS stream, and Aaronson collaboration all suggest an org that is intellectually alive and attracting talent despite its small size.

Clean incentives. Pure nonprofit, no commercial pressure, no lab dependency, modest salaries. The FTX grant return demonstrated ethical backbone. This is an org that can follow the truth wherever it leads.

Weaknesses and Risks

The empirical gap is vast. Random MLPs are to GPT-5 as toy boats are to aircraft carriers. The jump from "mechanistic estimation matches sampling on random two-layer networks" to "mechanistic estimation beats sampling on trained transformers at scale" is not incremental. It could easily turn out to be impossible.

Christiano's departure left a hole. He was the intellectual founder, the public face, the lab connector, and the government interlocutor. Hilton is capable but lower-profile. ARC's external influence has contracted significantly.

Financial trajectory is alarming. Revenue halved from 2023 to 2024. Reserves dropped from $7.1M to $3.3M. If this trend continues, ARC may face a funding crisis within 1-2 years. This is especially concerning because the research is explicitly long-term and high-risk -- exactly the kind of work that needs patient capital.

No theory-to-practice pipeline. Even if the theory works, there is no mechanism for getting it into the hands of labs. ARC has no partnerships, no deployment team, no policy staff. This is the most significant structural weakness in the theory of change.

Community criticism is valid. Aschenbrenner's critique that "basically all deep learning progress has been empirical" is historically accurate. The few cases where theory guided ML practice (PAC learning, VC dimension) are far narrower in scope than what ARC attempts. ARC is betting against the base rate.

Cross-References

METR (former ARC Evals): ARC's practical-impact spin-off. METR does what ARC used to do -- evaluate frontier models for dangerous capabilities. The two orgs are complementary in theory but no longer organizationally connected. Christiano declined a METR board role.

Redwood Research: Co-located at Constellation, Buck Shlegeris on ARC's board. Redwood does more empirical alignment work. The relationship is collegial but the research agendas are distinct.

Anthropic: Christiano was on Anthropic's Long-Term Benefit Trust. Anthropic's interpretability team (Chris Olah et al.) does the empirical version of what ARC does theoretically. ARC's approach could be seen as the "formalization" complement to Anthropic's "visualization" approach.

MIRI: Both MIRI and ARC do theoretical alignment work, but with very different aesthetics. MIRI focuses on decision theory and agent foundations; ARC focuses on formalizing interpretability. Aschenbrenner groups them together as "disconnected from actual ML" but ARC's recent empirical work on low probability estimation in transformers is a counter-signal.

What Would Change This Assessment

Upward:

  • ARC demonstrates MSP matching sampling on a trained transformer (even a small one). This would be a watershed moment.
  • A major funder commits multi-year patient capital, securing runway through the research program's completion.
  • A lab partnership emerges where ARC's theory is applied to a real safety case.

Downward:

  • ARC fails to extend MSP results beyond random/simple architectures after another 1-2 years of focused effort.
  • Further key researcher departures (especially Hilton or Neyman).
  • Financial runway drops below 12 months without new funding secured.
  • Another credible critic (not just Aschenbrenner) publishes a detailed technical argument that the MSP is fundamentally limited.

Self-Critique

What sources should I have checked but didn't?

  • David Matolcsi's LessWrong post about becoming skeptical of ARC's approach. This is potentially the most revealing critic source and was on a blocked domain.
  • Christiano's "My views on doom" LessWrong post -- key for understanding the worldview that founded ARC.
  • ARC's own "Obstacles" blog post content (full content on LessWrong, only index fetched).
  • Any private or semi-private communications about the 2024 financial situation.

Where is this analysis potentially biased?

  • I may be giving too much weight to Christiano's self-assessed 10-20% success probability. Founders are sometimes strategically humble. But in this case, his track record of candor (e.g., publicly stating 50% p(doom)) suggests this is genuine.
  • I may be overstating the financial concern. The 2024 990 could reflect one-time costs (METR spinoff transition, Christiano departure) rather than a structural funding decline.
  • I may be understating the value of theoretical work. The history of science is full of theoretical foundations that seemed disconnected until they suddenly weren't.

What would a thoughtful person who disagrees say? "ARC is doing exactly what needs to be done -- building the mathematical foundations that will be needed when current empirical approaches hit their limits. The criticism that 'all DL progress is empirical' is backward-looking. Formal safety guarantees are the future, and someone needs to start building the theory now. The fact that it hasn't scaled to transformers yet is expected for 2026 -- check back in 2030."

What's my single weakest claim? That the financial trajectory is "alarming." The 2024 numbers could be entirely explained by one-time METR spinoff costs, and ARC may have secured multi-year funding that simply hasn't been publicly announced.

What information would most change my view? Evidence that the MSP works (or provably cannot work) on trained transformers. This single result would either validate or invalidate the entire research program.

Connected to (8)

Sources (48)
Every URL that was read during research.
  1. 1.Alignment Research Center - Wikipediaen.wikipedia.org
  2. 2.Alignment Research Centeralignment.org
  3. 3.Teamalignment.org
  4. 4.Alignment Research Centeralignment.org
  5. 5.Alignment Research Center (Page 1)alignment.org
  6. 6.A bird's eye view of ARC's researchalignment.org
  7. 7.Formal verification, heuristic explanations and surprise accountingalignment.org
  8. 8.Obstacles in ARC's research agendaalignment.org
  9. 9.ELK prize resultsalignment.org
  10. 10.Funding from FTXalignment.org
  11. 11.ARC's first technical report: Eliciting Latent Knowledgealignment.org
  12. 12.Alignment Research Centeralignment.org
  13. 13.Hiringalignment.org
  14. 14.Paul Christiano - Preventing AI Takeoverdwarkesh.com
  15. 15.12 - AI Existential Risk with Paul Christianoaxrp.net
  16. 16.How AI will dominate the 21st century & what to do about it80000hours.org
  17. 17.Paul Christiano - Wikipediaen.wikipedia.org
  18. 18.Paul Christianonist.gov
  19. 19.ARC Evals is spinning out from ARCmetr.org
  20. 20.TIME100 AI 2023: Paul Christianotime.com
  21. 21.Alignment Research Center (ARC) at MATS: Summer 2026matsprogram.org
  22. 22.My Trip to the Alignment Research Centereregis.github.io
  23. 23.AI alignmentpaulfchristiano.com
  24. 24.Alignment Research Center - Nonprofit Explorer - ProPublicaprojects.propublica.org
  25. 25.Paul Christianopaulfchristiano.com
  26. 26.Jacob Hilton's Homepagejacobh.co.uk
  27. 27.Can we efficiently explain model behaviors?alignment.org
  28. 28.ARC Evals is now METRmetr.org
  29. 29.Nobody’s on the ball on AGI alignmentforourposterity.com
  30. 30.Theoretical Computer Science for AI Alignment … and Morescottaaronson.blog
  31. 31.ARC is hiring theoretical researchersalignment.org
  32. 32.Competing with samplingalignment.org
  33. 33.Research update: Towards a Law of Iterated Expectations for Heuristic Estimatorsalignment.org
  34. 34.U.S. Commerce Secretary Gina Raimondo Announces Expansion of U.S. AI Safety Institute Leadership Teamnist.gov
  35. 35.Our Givinggoodventures.org
  36. 36.How is the Alignment Research Center (ARC) trying to solve Eliciting Latent Knowledge (ELK)?aisafety.info
  37. 37.Feds Appoint 'AI Doomer' To Run US AI Safety Institute - Slashdotnews.slashdot.org
  38. 38.METR - Wikipediaen.wikipedia.org
  39. 39.Alignment Research Center | Covina, CA | 990 Report | Instrumentlinstrumentl.com
  40. 40.Compact Proofs of Model Performance via Mechanistic Interpretabilityarxiv.org
  41. 41.Is it possible to leave a helpful message for a future civilization, just in case humanity dies out?80000hours.org
  42. 42.SFF-2024 S-Process Recommendations Announcement | Survival and Flourishing Fundsurvivalandflourishing.fund
  43. 43.Alignment Research Center, Full Filing - Nonprofit Explorer - ProPublicaprojects.propublica.org
  44. 44.Charity Navigator - Rating for Alignment Research Centercharitynavigator.org
  45. 45.Low Probability Estimation in Language Modelsalignment.org
  46. 46.Backdoors as an analogy for deceptive alignmentalignment.org
  47. 47.Estimating Tail Risk in Neural Networksalignment.org
  48. 48.Mechanistic anomaly detection and ELKalignment.org