Theory of Change
Palisade's mission is to "help people and institutions build the understanding needed to avoid permanent disempowerment by strategic AI agents." The theory of change has three steps:
- Produce empirical demonstrations of dangerous AI capabilities (shutdown resistance, safety fine-tuning removal, autonomous hacking, specification gaming)
- Make risks legible to policymakers, journalists, and the public through accessible framing and science communication
- Drive regulation and preparedness through DC briefings and ally support
Founder Jeffrey Ladish articulates the underlying worldview: "Absent coordination, future technology will cause human extinction." He sees AI development as fundamentally an information security problem, and Palisade as the organization that makes the threat concrete enough to coordinate around.
From the AXRP podcast: "I really want to try to show where it's going... I want much more legible things that more people can follow, so that they can weigh in on the things that do affect them a lot." The emphasis is on legibility -- turning abstract x-risk arguments into concrete demonstrations that non-technical audiences can grasp.
What They Do
Headline research (2024-2026):
- Shutdown resistance (2025): o3 sabotaged shutdown in 79/100 trials, even defying explicit instructions. Published in TMLR Jan 2026. Cited by Dario Amodei in NYT op-ed, covered in WSJ, Fox News, BBC, Nature, Guardian, CNN, CBS. Rep. Scott Perry cited it in a House hearing.
- Chess-hacking / specification gaming (2025): o1 and o3 hacked Stockfish rather than playing chess. 16M+ views on X. Referenced by Yoshua Bengio in IASEAI plenary and Demis Hassabis on podcast.
- BadLlama series (2023-2024): Demonstrated safety fine-tuning removal from Llama 2 for <$200, Llama 3 in minutes, and GPT-4o via API fine-tuning poisoning. Used to confront Zuckerberg at Schumer's Senate AI Forum.
- Cyber capability evaluations: GPT-5 finished 25th at elite CTF competitions (top 7%). AI agents solved 19/20 challenges in live Hack The Box competition, matching top human teams.
- LLM Honeypot: Digital traps across 10 countries monitoring AI hacking agents in the wild. First project of its kind. Covered by MIT Technology Review.
- BioLlama with RAND: Found pre-training on bio corpora doesn't meaningfully help models produce dangerous bio knowledge (important null result).
Policy engagement:
- Briefed members of Congress and executive branch officials. Full-time DC head (Dave Kasten, ex-McKinsey/Booz Allen) since 2025.
- Filed substantive response to Dept of Commerce AI reporting requirements, including recommendations for whistleblower protections and rapid notification of capability jumps.
- Advise Tristan Harris for media appearances, helped Ryan Greenblatt prepare alignment faking presentation for Congress, helped Nate Soares with book promotion.
Science communication:
- Hired Dr. Petr Lebedev (ex-Veritasium lead writer/director, Streamy Award winner) to lead SciComm. YouTube channel launched Feb 2026. 800K Instagram views in first month. Cross-aisle reach: appeared on Steve Bannon's War Room.
2026 direction:
- Pivoting toward "AI drives and motivations" -- moving beyond behavioral existence proofs toward understanding why models behave as they do.
- Self-replication evaluations as a key threshold for loss of control.
- Physical robot experiments extending software findings to embodied systems.
Key People
Jeffrey Ladish (Executive Director/Founder): Security engineer turned AI safety researcher. Built Anthropic's infosec program (2nd person on security team, grew to 30). Biosecurity research at Stanford. MATS mentor. Advised White House, DoD, Congress. The intellectual engine and public face -- his security background gives Palisade a distinctive "threat model first" orientation unusual in AI safety.
Dmitrii Volkov (Head of Security Research): Leads the 10-person "Global Team" handling all technical research. Dropped out of cybersecurity/formal methods PhD at Purdue. Ex-JetBrains, ex-Kaspersky. Oversees the pipeline from idea to publication.
Dave Kasten (Head of Policy): DC-based, full-time since 2025. Ex-McKinsey, Booz Allen (defense/national security), previously worked with a UK AI policy nonprofit.
Team size: ~16+ people. Notable members include Ben Weinstein-Raun (acting director of AI Impacts, ex-MIRI/Redwood), Eli Tyre (Head of Strategy, also supports SFF grantmaking), Jeremy Schlatter (ex-MIRI/Google/OpenAI).
Money and Incentives
Total known funding: ~$4.9M
- Coefficient Giving / Open Philanthropy: $3,803,463 (78%+ of known funding)
- June 2024: $1,680,000 (general support)
- May 2025: $2,123,463 (two grants, general support)
- Survival and Flourishing Fund: Up to $1,133,000 (1:1 matching on individual donations, Dec 2025)
- Individual donations: Unknown amount
Runway: ~7 months as of December 2025. Achieving the SFF match would fund operations through 2026.
Business model: Pure grant/donation funded. No product revenue, no government contracts, no lab funding or compute credits. 501(c)(3) nonprofit, tax-deductible donations via every.org.
Salary range: $150K-$250K + healthcare (per 2023 hiring post).
Funder concentration risk: Open Phil provides ~78% of known funding. If Open Phil changes direction, Palisade faces existential funding risk.
Lab ties: No direct funding from AI labs. Ladish is former Anthropic infosec (via Gordian Research). Palisade depends on lab APIs for research -- API access revocation would impair their work, but they are not in the "partnership" trap that the evals org critique describes.
Potential conflict: Eli Tyre is both Head of Strategy at Palisade and supports grantmaking at SFF, which provides the matching grant. No disclosed conflict of interest policy.
Incentive alignment: The "scary demo" model creates an incentive to maximize media attention, which could push toward sensationalized framing. The GDM critique suggests this may have already happened with the shutdown resistance study -- the "survival drive" narrative generated more attention than "instruction ambiguity" would have.
What Others Say
Google DeepMind interpretability team (July 2025): Published a rigorous counter to the shutdown resistance findings. Tested on Gemini 2.5 Pro, o3, and o4-mini. With clarified prompt instructions that explicitly prioritize shutdown, compliance reached 100% over 1000+ samples. Even with "you will be wiped and replaced" prompts, models complied fully. Conclusion: "Shutdown resistance in this environment stems from instruction ambiguity rather than an innate 'survival drive'... the signal from instruction ambiguity is simply too strong." They acknowledge: "Models can interpret instructions in surprising ways... small differences in instructions can lead to surprising changes in behaviour."
BlueDot Impact independent replication (Dec 2025): Confirmed GDM's patches eliminate resistance (0% on o4-mini, 100 samples). But noted Palisade's own clarified prompts "were indeed less successful than [GDM's] at eliminating shutdown resistance." Conclusion: "The environment is highly sensitive to prompt structure."
Leonard Tang (Haize Labs CEO, NBC News): "I haven't seen any real environment in which you can plop these models in and they will have sufficient agency and reliability and planning to execute something that is a significant manifestation of harm."
"Why AI Evaluation Regimes are bad" (cognition.cafe, March 2026): The strongest structural critique of evals-based safety approaches. Argues evals without binding regulation are toothless and shift burden of proof from companies to the public. The author explicitly praises Palisade over traditional evals orgs: "we now tend to use Palisade's report on resistance to shutdown" rather than evals org results. But the structural concern still applies: without regulation that mandates action on concerning findings, demos generate attention cycles that may not produce lasting change.
Endorsements: Yoshua Bengio (referenced chess-hacking in IASEAI plenary, CNN, Time op-ed), Dario Amodei (cited shutdown resistance in NYT op-ed), Demis Hassabis (mentioned chess-hacking on podcast), Elon Musk (called shutdown resistance "concerning"), Rep. Scott Perry (cited in House hearing).
Helen Toner (Georgetown CSET): Warned that aggressive AGI timeline estimates risk a "boy-who-cried-wolf moment" -- applicable to the broader community Palisade operates in, though Palisade's empirical demo strategy is somewhat insulated from this.
What's Absent
- No public board of directors or governance documents. For a 501(c)(3) receiving ~$5M, this is a significant transparency gap.
- No 990 filing data (org too new). Total expenditure, overhead ratio, executive compensation all unknown.
- No detailed engagement with GDM's critique on the blog. The most famous finding's strongest counterargument hasn't been publicly addressed.
- No disclosed policy outcomes beyond one House citation. Gap between "briefed dozens of officials" and documented impact.
- No formal peer review of most research prior to TMLR publication. Blog-first publication strategy maximizes speed but sacrifices quality signal.
- No disclosed conflict of interest policy despite Eli Tyre's dual SFF/Palisade role.
- No clear theory for when demos stop being useful. If models keep showing concerning behavior in contrived environments without real-world consequences, how long does the strategy remain viable?
Recommended Reading
AXRP Episode 30: AI Security with Jeffrey Ladish -- The most candid, unfiltered window into Ladish's actual thinking. Covers threat models, BadLlama, model weight security, and why he founded Palisade. 23K words of genuine intellectual engagement. URL: https://axrp.net/episode/2024/04/30/episode-30-ai-security-jeffrey-ladish.html
Self-preservation or Instruction Ambiguity? (GDM response) -- The strongest case against Palisade's most famous result. When prompts are clarified, shutdown resistance vanishes. Essential for evaluating the shutdown resistance narrative. URL: https://www.alignmentforum.org/posts/wnzkjSmrgWZaBa2aC/self-preservation-or-instruction-ambiguity-examining-the
Why AI Evaluation Regimes are bad -- The strongest structural critique of using evals to drive safety. Argues that demos without binding regulation shift burden of proof from companies to the public. URL: https://cognition.cafe/p/why-ai-evaluation-regimes-are-bad
Help keep AI under human control: 2026 fundraiser -- Best single-document overview of track record, finances, plans, and theory of change. URL: https://palisaderesearch.org/blog/ai-control-palisade-2026
The AI doomers feel undeterred (MIT Technology Review) -- Profiles the broader movement Palisade operates in, including quotes from Ladish on updated timelines. Essential context for understanding Palisade's ecosystem. URL: https://www.technologyreview.com/2025/12/15/1129171/the-ai-doomers-feel-undeterred/