← AI Safety Orgs

Cadenza Labs

Research

Newer safety startup. LLM steering.

Founded
2023
HQ
Distributed / Europe
Team
3
Model
Grants

Theory of Change

Cadenza Labs states: "The goal of our group is to do research which contributes to AI Safety. Our main goal is to research, benchmark and develop robust lie detectors for LLMs."

The causal chain, reconstructed from their papers and founding documents:

  1. Current lie detection methods (probes, LLM judges, behavioral classifiers) fail to generalize across different types of lies.
  2. Without robust lie detectors, we cannot catch deceptive AI systems -- and "every story of alignment that is actually scary involves AI systems deceiving humans" (Collin Burns, whose DLK research Cadenza builds on).
  3. By building comprehensive benchmarks (Liars' Bench) that expose failure modes, organizing competitions to advance the field, and developing improved probing methods (Cluster-Norm), progress toward robust detection becomes possible.
  4. Eventually, reliable lie detectors would enable deployment monitoring, sandbagging detection in evaluations, and alignment auditing of AI systems.

The underlying theoretical vision comes from Kaarel Hanmi's 2023 framework: find concepts in neural networks by searching for features that satisfy structural relations (Ramsification). DLK/CCS is a special case where the concept "truth" is identified by negation coherence. Cadenza's current work focuses on the applied/benchmarking side of this broader program.

What They Do

Liars' Bench (November 2025): A benchmark of 72,863 labeled examples of lies and honest responses generated by four open-weight models across seven datasets. Organizes lies along two axes: what the model lies about (world-knowledge vs. self-knowledge) and why it lies (encouraged vs. inherent). Key finding: no existing method achieved above-chance performance across all settings. Linear probes scored 0.54 balanced accuracy (barely above chance). LLM-as-a-Judge was best at 0.73 but still failed on certain lie types. Co-authored with Samuel Marks (Anthropic). Dataset and code released publicly.

Lie Detection Competition (March-Summer 2026): Jointly organized with Schmidt Sciences and NDIF. Red teams create new lie datasets ($10K-$27K per team). Blue teams build general-purpose detectors. This is Cadenza's largest initiative -- moving from producing research to organizing the field.

Cluster-Norm (EMNLP 2024): Technical improvement to unsupervised probing that clusters and normalizes activations before applying probes. Also presented at ICML 2024 MechInterp workshop.

AI-AI Bias (PNAS 2025): Finding that LLMs prefer AI-generated content over human-authored content. Done in collaboration with ACS Research Group (Prague). Tangential to core mission but published in a high-impact journal.

Open-source tools: Walter Laurito is a main contributor to EleutherAI's ELK project and CCS-Lib (published in JOSS). 17 GitHub repositories including forks of sleeper-agents and MASK benchmark.

Key People

Walter Laurito -- Team lead. Only remaining original founding member. Doctoral researcher at FZI/KIT (splits time with Cadenza). MATS Winter 2023 under John Wentworth. Software engineering background. Main contributor to ELK/CCS-Lib. Implemented benchmarks for UK AISI.

Kieron Kretschmar -- Lead author on Liars' Bench. M.Sc. UvA (cum laude). Thesis on truthfulness representations co-authored with Walter. Previously co-founded two startups.

Samuel Marks (external collaborator) -- Cognitive oversight subteam lead at Anthropic. Co-author on Liars' Bench. MATS mentor for lie detection research. Bridges Cadenza to frontier lab work on honesty.

The team was originally 4 (Kaarel Hanmi, Kay Kozaronek, Walter Laurito, Georgios Kaklamanos) from SERI MATS 3.0. Three of four have departed or moved to collaborator/advisor status. Departures are undocumented. Sharan Maiya (Cambridge PhD, MATS under Hubinger) is listed as a researcher but appears to split time across multiple affiliations. Current active headcount is likely 2-3.

Money and Incentives

Total known external cash funding: ~$167K (LTFF, March 2023) + unknown Coefficient Giving amount + unknown Manifund amount. The LTFF grant ended September 2023.

Coefficient Giving support: Acknowledged in the Liars' Bench paper (November 2025) for "supporting Kieron, Sharan and Walter." Not found in the CG public grants database. Amount and mechanism unknown.

In-kind support: Walter's FZI/KIT doctoral position (salary/stipend). Sharan's UKRI CDT fellowship at Cambridge. FAR Labs Berkeley residency (flights + accommodation for 2). NDIF provides compute infrastructure for the competition.

Schmidt Sciences partnership: Cadenza co-organizes the lie detection competition. Whether Cadenza receives organizational funding from Schmidt (beyond the competition budget for red/blue teams) is unclear. Schmidt also has a separate $300K-$1M Interpretability RFP (due May 2026) that Cadenza could pursue.

Business model: Pure grants. No product revenue, no contracts, no consulting. Team members subsidize participation through academic positions. For context, Apollo Research founder Marius Hobbhahn wrote in 2023 that Kaarel's DLK work was "worth $500K+ for a year." Walter Laurito commented in agreement about funding needs. The gap between perceived research value and actual funding has been persistent.

No formal legal entity found. No incorporation records, EIN, 990, or evidence of fiscal sponsorship. Grant disbursement mechanism is unknown.

Incentive analysis: Cadenza has no commercial incentives, no lab ties that create conflicts, and no product that could be co-opted for capabilities. Their incentive structure is simple: produce research and benchmarks that funders value enough to continue supporting. The risk is not misalignment of incentives but extinction of the org due to underfunding.

What Others Say

The case against the approach (not the org):

Levinstein & Herrmann (2023), "Still No Lie Detector for Language Models": Supervised probes learn spurious correlations, not truth. Probes trained on positive statements performed worse than chance on negated versions. CCS probes find features that "correlate well with truth on the training sets but do not correlate with truth in even mildly more general contexts." This directly challenges the foundations Cadenza builds on.

Berger (2026), "Probing the Limits of the Lie Detector Approach": Lie detectors miss deception-without-lying. LLMs can strategically produce misleading non-falsities that evade truth probes. "If mechanistic approaches equate deception detection with lie detection, they systematically ignore and remain prone to a large and potent class of LLM deception." Cadenza's assertion-based definition explicitly excludes this threat.

"Building Better Deception Probes" (Feb 2026): 70.6% of probe performance variance comes from the instruction pair used during training. "Organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector." This complicates the competition's premise of building general-purpose detectors.

The case for the approach:

Collin Burns (DLK originator): "If the model is totally honest all the time, then surely you can just ask it, 'Are you going to do something dangerous or not?' And it's like, 'Yes, I am.' Then I think we're in okay shape." He argued for unsupervised methods because "human feedback will break down once you get to superhuman models."

Anthropic's Samuel Marks team: Their recent paper on honesty interventions explicitly adopts Cadenza's Liars' Bench taxonomy and framework. This is the clearest signal that frontier lab researchers find Cadenza's work useful.

Marius Hobbhahn (Apollo Research): Explicitly named Kaarel's DLK work as worth "$500K+ for a year to develop it further."

No external criticism of Cadenza Labs as an organization was found. Criticism targets the broader research approach.

What's Absent

  • No candid interviews, podcasts, or long-form discussions by any team member. This is the most significant gap for understanding the team's actual thinking.
  • No documented explanation for three of four founding members departing.
  • No formal legal entity despite handling grants and organizing a competition.
  • No transparent budget or financial reporting.
  • No explicit theory-of-change document.
  • No public response to the "deception without lying" critique (Berger 2026). The competition still uses the assertion-based definition.
  • Manifund funding details remain unknown (429 errors prevented access).

Recommended Reading

  1. "Still No Lie Detector for Language Models" (Levinstein & Herrmann, 2023) -- The strongest challenge to the probing foundations. Essential for understanding the intellectual terrain. https://arxiv.org/abs/2307.00175

  2. "Searching for a model's concepts by their shape" (Kaarel Hanmi, LW, Feb 2023) -- The founding theoretical vision, broader than current Cadenza work. Shows what motivated the research program. https://www.lesswrong.com/posts/Go5ELsHAyw7QrArQ6/

  3. Liars' Bench paper (Kretschmar et al., 2025) -- Cadenza's flagship output. The taxonomy and failure-mode results are the core contribution. https://arxiv.org/abs/2511.16035

  4. "There should be more AI safety orgs" (Hobbhahn, LW, 2023) -- Context for ecosystem position. Explicit praise of this work, Walter's comment about funding needs. https://www.lesswrong.com/posts/MhudbfBNQcMxBBvj8/

Show Claude’s analysis
An opinionated read. Read the brief first to form your own view.

Stated Theory of Change

Cadenza Labs exists to "research, benchmark and develop robust lie detectors for LLMs." The implicit causal chain is:

  1. AI deception is a critical alignment risk -- if models can lie undetectably, all safety measures that rely on model reports become unreliable.
  2. Current lie detection methods (probes, LLM judges, behavioral classifiers) have been tested only in narrow settings and fail when challenged with diverse lie types.
  3. By creating comprehensive benchmarks that expose these failures, Cadenza forces the field to confront the actual difficulty of the problem.
  4. By organizing competitions, Cadenza accelerates development of better methods through structured adversarial testing (red teams create lies, blue teams detect them).
  5. Eventually, reliable lie detectors would enable deployment monitoring, evaluation safeguarding, and alignment auditing -- three concrete safety applications.

The deeper theoretical foundation (from Kaarel Hanmi's 2023 post) frames this as a special case of finding concepts in neural networks by their structural properties. Truth/honesty is the first and most safety-relevant concept to search for, with the long-term vision of being able to find any concept (utility, goals, beliefs about other agents) in model internals.

Revealed Theory of Change

What Cadenza actually spends its limited resources on reveals a more specific theory:

Benchmarking as field-building. The team's primary output is evaluation infrastructure -- Liars' Bench, CCS-Lib, the competition. They are not trying to build a working lie detector; they are trying to systematically demonstrate that current detectors don't work, and to create the tools that enable others to build better ones. This is fundamentally an infrastructure play.

Taxonomy over engineering. The two-axis categorization of lies (object of belief x reason for lying) is arguably more valuable than any specific detection method. By providing a shared vocabulary for discussing different kinds of AI dishonesty, Cadenza shapes how the field thinks about the problem.

Competition as leverage. The Schmidt Sciences / NDIF competition lets a 3-person team punch far above its weight. Rather than hiring more researchers, they're using competition structure to get dozens of external teams producing lie datasets and detection methods.

The gap between stated and revealed: The original vision (from Kaarel's framework) was about unsupervised concept discovery -- finding truth representations in models without supervision. The current work has shifted toward supervised benchmarking and evaluation. This is pragmatic but narrower. The team no longer appears to be pursuing the "Ramsification" approach to finding concepts by their shape.

Key Assumptions

Assumption 1: Models have recoverable internal representations of truth/belief that are distinct from output behavior.

  • Evidence for: Burns et al. (2022) DLK results, various probing papers showing truth-correlated directions in activation space, representation engineering results.
  • Evidence against: Levinstein & Herrmann argue probes learn spurious correlations. The "model might not represent 'is this true or not'" (Collin Burns' own caveat).
  • Testable? Yes -- this is exactly what probing research tests. Cadenza's own results show probes barely beat chance on diverse lie types.
  • If wrong: The entire probing-based approach fails. Behavioral methods (LLM-as-a-Judge) become the only viable approach, which is what Liars' Bench actually found -- the black-box judge outperformed white-box probes.

Assumption 2: Lie detection at the assertion level is sufficient for safety.

  • Evidence for: Collin Burns' argument that "if the model is totally honest, we're in okay shape" is compelling for the simplest threat models.
  • Evidence against: Berger (2026) shows models can deceive without lying. A model that tells only technically-true-but-misleading statements evades assertion-based detectors entirely.
  • Testable? Yes -- Berger's paper already demonstrates this failure mode.
  • If wrong: Cadenza needs to broaden from "lie detection" to "deception detection," which requires fundamentally different methods (probing for intent, second-order beliefs, or communicative function rather than truth/falsity).

Assumption 3: Open-weight model lie detection results transfer to frontier closed models.

  • Evidence for: Anthropic's paper uses similar taxonomies and finds similar patterns. Some lie types are hard to detect regardless of model size.
  • Evidence against: Cadenza only works with open-weight models (Llama, Qwen, Mistral, Gemma). Frontier model behavior may be qualitatively different. Labs doing their own internal lie detection have advantages Cadenza cannot access.
  • If wrong: Cadenza's benchmarks become exercises for the open-weight community but don't address the hardest safety problems at the frontier.

Assumption 4: Competition structure advances the field effectively.

  • Evidence for: ML competitions (ImageNet, SQuAD, etc.) have historically driven rapid progress. The red team / blue team structure is well-suited to adversarial evaluation.
  • Evidence against: Deception detection may be fundamentally different from standard ML tasks. The "Building Better Deception Probes" finding that 70.6% of variance comes from prompt choice suggests that "general-purpose detectors" may be impossible, which undermines the competition framing.
  • If wrong: The competition produces interesting datasets but no generalizable detection methods.

Strengths

  1. Fills a genuine gap. Before Liars' Bench, lie detection methods were validated on narrow, unrealistic datasets. Cadenza created the first comprehensive benchmark with diverse on-policy lie types.

  2. Intellectual clarity. The two-axis taxonomy of lies is clean, operationalized, and adopted by others (including Anthropic). Good conceptual work enables faster progress by the entire field.

  3. Punching above weight. A 3-person team with under $200K in confirmed funding produced an EMNLP paper, a PNAS paper, a major benchmark, and organized a competition backed by Schmidt Sciences. The output-to-resource ratio is exceptional.

  4. Strategic positioning. By building evaluation infrastructure rather than competing on specific methods, Cadenza occupies a non-rival niche. When better detectors are built, they will be validated on Cadenza's benchmarks.

  5. Right collaborator network. Samuel Marks (Anthropic), Alex Mallen (Redwood), Erik Jenner (DeepMind), Chris Cundy (FAR.AI) provide connections across the safety ecosystem. The Liars' Bench acknowledgments include many top interpretability researchers.

  6. Low incentive risk. No commercial products, no lab funding dependencies, no equity investors. Cadenza's incentives are aligned with its mission in a way that larger, more complex organizations often are not.

Weaknesses and Risks

  1. Fragile organization. A 3-person team with no legal entity, no formal governance, undisclosed finances, and 75% founding team turnover is structurally fragile. If Walter Laurito left or became unavailable, it is unclear whether Cadenza would continue to exist.

  2. The "deception without lying" blind spot. Berger (2026) demonstrated that the assertion-based definition misses a critical class of AI deception. Cadenza has not publicly engaged with this critique. Their competition still uses the assertion-based definition. If the field pivots to broader deception detection, Cadenza's framing becomes outdated.

  3. Resource asymmetry with labs. Anthropic's Samuel Marks team published a paper doing everything Cadenza does plus more, using frontier models and proprietary fine-tuning. Labs can always one-up independent benchmarking work. Cadenza's value depends on being an independent evaluation authority -- but they may be squeezed out if labs internalize this function.

  4. Chronic underfunding. The team has operated for 2+ years on under $200K in confirmed external funding. Team members subsidize participation through PhD positions. This is not sustainable for the kind of systematic benchmarking work they aspire to do.

  5. Missing theory-of-impact. There is no clear path from "competition produces better detectors" to "detectors are deployed in production AI systems." The competition has no formal relationship with any lab's safety team or deployment process.

  6. No candid public articulation. Zero podcast appearances, zero blog posts explaining their strategy, zero public reflections on setbacks or pivots. The analysis must rely entirely on inferring intentions from papers and a 3-page website.

Cross-References

Complementary to Anthropic's safety team: Cadenza provides independent benchmarking that Anthropic's internal team cannot credibly provide (independence matters for evaluation). Samuel Marks' involvement bridges both.

Similar niche to METR (formerly ARC Evals): Both provide evaluation infrastructure. METR focuses on dangerous capability evaluations; Cadenza focuses on deception detection evaluations. The approaches are complementary.

Overlapping with Apollo Research: Apollo works on detecting deception in LLMs but approaches it through different methods (strategic deception in agentic settings, in-context scheming). Apollo is better-funded and larger.

Related to EleutherAI's ELK project: Cadenza's Walter Laurito contributes to EleutherAI's CCS-Lib. Both work on the same foundational tools for probing model knowledge.

Potential beneficiary of Schmidt Sciences Interpretability RFP: The $300K-$1M RFP (due May 2026) specifically seeks deception detection research. Cadenza is well-positioned to apply. This could be a transformative funding event for the org.

What Would Change This Assessment

  • Upward: Cadenza secures substantial funding (>$300K) from Schmidt Sciences or another funder, enabling full-time researchers and addressing the sustainability concern. The competition produces a detection method that meaningfully outperforms existing baselines across diverse lie types. Cadenza addresses the deception-without-lying critique by expanding their framework.

  • Downward: Walter Laurito leaves or reduces involvement. The competition fails to attract strong teams or produces only incremental results. Another group (Anthropic, Apollo, a well-funded academic lab) subsumes the benchmarking function. The field consensus shifts to "probing-based lie detection is a dead end."

  • Neutral but important: Cadenza formalizes as a legal entity and publishes finances. Team members give interviews explaining their strategy. Kaarel Hanmi returns to active involvement, reinvigorating the theoretical side.

Self-Critique

What sources should I have checked but didn't?

  • Manifund page and donor list (blocked by 429 errors)
  • Kieron Kretschmar's thesis (would reveal depth of his distributional-shift work)
  • Any private communications or grant applications that might explain team transitions
  • Conference presentation videos if any exist

Where is this analysis potentially biased?

  • I may be overly sympathetic to a small underfunded team doing careful work. The scrappiness can obscure genuine strategic problems.
  • I may be projecting a theory of change onto a team that has not articulated one. They may simply be researchers doing interesting work without a strategic plan for impact.
  • The "deception without lying" critique may be less fatal than I present it, if Cadenza sees lie detection as one component rather than a complete solution. But without interviews or strategy documents, I cannot know.

What would a thoughtful person who disagrees say? "Cadenza is an informal group of part-time PhD students who happened to produce one good benchmark paper. The competition is organized by Schmidt Sciences and NDIF -- Cadenza is along for the ride. They have no path to deployment, no legal entity, no money, and no evidence they can solve the hard parts of the problem. The theoretical vision from Kaarel is gone and what remains is competent but not field-shaping benchmarking work."

What's my single weakest claim? That Cadenza's benchmarking work will actually improve AI safety outcomes. It is entirely possible that Liars' Bench becomes a useful academic exercise without ever influencing how labs build or deploy models. The missing link between "benchmark exists" and "safety improves" is the weakest part of the entire theory of change.

What information would most change my view? A candid interview with Walter Laurito explaining: (1) what happened with Kaarel, Kay, and Georgios, (2) where the money actually comes from, (3) what the plan is for deploying competition results, and (4) whether they plan to address deception-without-lying. This single source would resolve most of the uncertainty in this analysis.

Connected to (12)

Schmidt Sciencescollaborator
NDIFcollaborator
ACS Research Groupcollaborator
Anthropiccollaborator · Samuel Marks
EleutherAIcollaborator · Walter Laurito
FAR.AIadvisor at · Chris Cundy
Google DeepMindadvisor at · Erik Jenner
PRISM Evalstaff to · Kay Kozaronek
Redwood Researchadvisor at · Alex Mallen
UK AI Safety Institutecollaborator · Walter Laurito
Catalyze Impactspun off from · Kay Kozaronek
MATSstaff from · Walter Laurito
Sources (53)
Every URL that was read during research.
  1. 1.Building Robust Lie Detectorscadenzalabs.org
  2. 2.Unknowncadenzalabs.org
  3. 3.Researchcadenzalabs.org
  4. 4.Red Team: Lie Detection Competition — Request for Proposalscadenza-labs.github.io
  5. 5.Cadenza Labsgithub.com
  6. 6.Kay Kozaronek | Operations Lead of AI Safety Connect — AI Safety Connectaisafetyconnect.org
  7. 7.Liars' Bench: Evaluating Lie Detectors for Language Modelsarxiv.org
  8. 8.Ecosystem¶nnsight.net
  9. 9.Sharan Maiyasharanmaiya.com
  10. 10.Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmasarxiv.org
  11. 11.Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas - MATS Researchmatsprogram.org
  12. 12.Cluster-norm for Unsupervised Probing of Knowledgearxiv.org
  13. 13.[REVIEW]: CCS-Lib: A Python package to elicit latent knowledge from LLMs · Issue #6511 · openjournals/joss-reviewsgithub.com
  14. 14.Walter Lauritoscholar.google.com
  15. 15.AI Safety Events & Training: 2026 week 12 updateaisafetyeventsandtraining.substack.com
  16. 16.Schmidt Sciencesschmidtsciences.smapply.io
  17. 17.Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocksarxiv.org
  18. 18.Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocksarxiv.org
  19. 19.Liars’ Bench: Evaluating Lie Detectors for Language Modelsarxiv.org
  20. 20.Collin Burns On Discovering Latent Knowledge In Language Models Without Supervisiontheinsideview.ai
  21. 21.Unknownweb.stanford.edu
  22. 22.Cluster-Norm for Unsupervised Probing of Knowledgeaclanthology.org
  23. 23.lauritowal - Overviewgithub.com
  24. 24.AI-AI Bias: large language models favor communications generated by large language modelsarxiv.org
  25. 25.Catalyze – The Global Incubator for AI Safety Founderscatalyze-impact.org
  26. 26.NSF National Deep Inference Fabricndif.us
  27. 27.Eliciting Latent Knowledge — EleutherAIeleuther.ai
  28. 28.GitHub - EleutherAI/elk: Keeping language models honest by directly eliciting knowledge encoded in their activations.github.com
  29. 29.CCS-Lib: A Python package to elicit latent knowledge from LLMsjoss.theoj.org
  30. 30.Walter Lauritofzi.de
  31. 31.GitHub - Cadenza-Labs/sleeper-agentsgithub.com
  32. 32.GitHub - Cadenza-Labs/liars-benchgithub.com
  33. 33.Mathematical Models of Computation in Superpositionarxiv.org
  34. 34.The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systemsarxiv.org
  35. 35.The MASK Benchmark: Disentangling Honestyfrom Accuracy in AI Systemsmask-benchmark.ai
  36. 36.April 2023: Long-Term Future Fund grant recommendations | Effective Altruism Fundsfunds.effectivealtruism.org
  37. 37.gekaklam - Overviewgithub.com
  38. 38.Erik Jennerejenner.com
  39. 39.Probing the Limits of the Lie Detector Approach to LLM Deceptionarxiv.org
  40. 40.Evaluating honesty and lie detection techniques on a diverse suite of dishonest modelsalignment.anthropic.com
  41. 41.Samuel Marks — Cambridge Boston Alignment Initiativecbai.ai
  42. 42.Samuel Marks - MATS Mentormatsprogram.org
  43. 43.AI Interpretability - Schmidt Sciencesschmidtsciences.org
  44. 44.About mebutanium.github.io
  45. 45.nnterp: A Standardized Interface for Mechanistic Interpretability of Transformersarxiv.org
  46. 46.182lesswrong.com
  47. 47.LESSWRONG LWlesswrong.com
  48. 48.Truth is Universal: Robust Detection of Lies in LLMsarxiv.org
  49. 49.Cadenza-Labs/liars-bench · Datasets at Hugging Facehuggingface.co
  50. 50.Building Better Deception Probes Using Targeted Instruction Pairsarxiv.org
  51. 51.About ACSacsresearch.org
  52. 52.Probing the Limits of the Lie Detector Approach to LLM Deceptionarxiv.org
  53. 53.thousand-year-old vampire any% speedrun attemptkaarelh.github.io