Theory of Change
Cadenza Labs states: "The goal of our group is to do research which contributes to AI Safety. Our main goal is to research, benchmark and develop robust lie detectors for LLMs."
The causal chain, reconstructed from their papers and founding documents:
- Current lie detection methods (probes, LLM judges, behavioral classifiers) fail to generalize across different types of lies.
- Without robust lie detectors, we cannot catch deceptive AI systems -- and "every story of alignment that is actually scary involves AI systems deceiving humans" (Collin Burns, whose DLK research Cadenza builds on).
- By building comprehensive benchmarks (Liars' Bench) that expose failure modes, organizing competitions to advance the field, and developing improved probing methods (Cluster-Norm), progress toward robust detection becomes possible.
- Eventually, reliable lie detectors would enable deployment monitoring, sandbagging detection in evaluations, and alignment auditing of AI systems.
The underlying theoretical vision comes from Kaarel Hanmi's 2023 framework: find concepts in neural networks by searching for features that satisfy structural relations (Ramsification). DLK/CCS is a special case where the concept "truth" is identified by negation coherence. Cadenza's current work focuses on the applied/benchmarking side of this broader program.
What They Do
Liars' Bench (November 2025): A benchmark of 72,863 labeled examples of lies and honest responses generated by four open-weight models across seven datasets. Organizes lies along two axes: what the model lies about (world-knowledge vs. self-knowledge) and why it lies (encouraged vs. inherent). Key finding: no existing method achieved above-chance performance across all settings. Linear probes scored 0.54 balanced accuracy (barely above chance). LLM-as-a-Judge was best at 0.73 but still failed on certain lie types. Co-authored with Samuel Marks (Anthropic). Dataset and code released publicly.
Lie Detection Competition (March-Summer 2026): Jointly organized with Schmidt Sciences and NDIF. Red teams create new lie datasets ($10K-$27K per team). Blue teams build general-purpose detectors. This is Cadenza's largest initiative -- moving from producing research to organizing the field.
Cluster-Norm (EMNLP 2024): Technical improvement to unsupervised probing that clusters and normalizes activations before applying probes. Also presented at ICML 2024 MechInterp workshop.
AI-AI Bias (PNAS 2025): Finding that LLMs prefer AI-generated content over human-authored content. Done in collaboration with ACS Research Group (Prague). Tangential to core mission but published in a high-impact journal.
Open-source tools: Walter Laurito is a main contributor to EleutherAI's ELK project and CCS-Lib (published in JOSS). 17 GitHub repositories including forks of sleeper-agents and MASK benchmark.
Key People
Walter Laurito -- Team lead. Only remaining original founding member. Doctoral researcher at FZI/KIT (splits time with Cadenza). MATS Winter 2023 under John Wentworth. Software engineering background. Main contributor to ELK/CCS-Lib. Implemented benchmarks for UK AISI.
Kieron Kretschmar -- Lead author on Liars' Bench. M.Sc. UvA (cum laude). Thesis on truthfulness representations co-authored with Walter. Previously co-founded two startups.
Samuel Marks (external collaborator) -- Cognitive oversight subteam lead at Anthropic. Co-author on Liars' Bench. MATS mentor for lie detection research. Bridges Cadenza to frontier lab work on honesty.
The team was originally 4 (Kaarel Hanmi, Kay Kozaronek, Walter Laurito, Georgios Kaklamanos) from SERI MATS 3.0. Three of four have departed or moved to collaborator/advisor status. Departures are undocumented. Sharan Maiya (Cambridge PhD, MATS under Hubinger) is listed as a researcher but appears to split time across multiple affiliations. Current active headcount is likely 2-3.
Money and Incentives
Total known external cash funding: ~$167K (LTFF, March 2023) + unknown Coefficient Giving amount + unknown Manifund amount. The LTFF grant ended September 2023.
Coefficient Giving support: Acknowledged in the Liars' Bench paper (November 2025) for "supporting Kieron, Sharan and Walter." Not found in the CG public grants database. Amount and mechanism unknown.
In-kind support: Walter's FZI/KIT doctoral position (salary/stipend). Sharan's UKRI CDT fellowship at Cambridge. FAR Labs Berkeley residency (flights + accommodation for 2). NDIF provides compute infrastructure for the competition.
Schmidt Sciences partnership: Cadenza co-organizes the lie detection competition. Whether Cadenza receives organizational funding from Schmidt (beyond the competition budget for red/blue teams) is unclear. Schmidt also has a separate $300K-$1M Interpretability RFP (due May 2026) that Cadenza could pursue.
Business model: Pure grants. No product revenue, no contracts, no consulting. Team members subsidize participation through academic positions. For context, Apollo Research founder Marius Hobbhahn wrote in 2023 that Kaarel's DLK work was "worth $500K+ for a year." Walter Laurito commented in agreement about funding needs. The gap between perceived research value and actual funding has been persistent.
No formal legal entity found. No incorporation records, EIN, 990, or evidence of fiscal sponsorship. Grant disbursement mechanism is unknown.
Incentive analysis: Cadenza has no commercial incentives, no lab ties that create conflicts, and no product that could be co-opted for capabilities. Their incentive structure is simple: produce research and benchmarks that funders value enough to continue supporting. The risk is not misalignment of incentives but extinction of the org due to underfunding.
What Others Say
The case against the approach (not the org):
Levinstein & Herrmann (2023), "Still No Lie Detector for Language Models": Supervised probes learn spurious correlations, not truth. Probes trained on positive statements performed worse than chance on negated versions. CCS probes find features that "correlate well with truth on the training sets but do not correlate with truth in even mildly more general contexts." This directly challenges the foundations Cadenza builds on.
Berger (2026), "Probing the Limits of the Lie Detector Approach": Lie detectors miss deception-without-lying. LLMs can strategically produce misleading non-falsities that evade truth probes. "If mechanistic approaches equate deception detection with lie detection, they systematically ignore and remain prone to a large and potent class of LLM deception." Cadenza's assertion-based definition explicitly excludes this threat.
"Building Better Deception Probes" (Feb 2026): 70.6% of probe performance variance comes from the instruction pair used during training. "Organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector." This complicates the competition's premise of building general-purpose detectors.
The case for the approach:
Collin Burns (DLK originator): "If the model is totally honest all the time, then surely you can just ask it, 'Are you going to do something dangerous or not?' And it's like, 'Yes, I am.' Then I think we're in okay shape." He argued for unsupervised methods because "human feedback will break down once you get to superhuman models."
Anthropic's Samuel Marks team: Their recent paper on honesty interventions explicitly adopts Cadenza's Liars' Bench taxonomy and framework. This is the clearest signal that frontier lab researchers find Cadenza's work useful.
Marius Hobbhahn (Apollo Research): Explicitly named Kaarel's DLK work as worth "$500K+ for a year to develop it further."
No external criticism of Cadenza Labs as an organization was found. Criticism targets the broader research approach.
What's Absent
- No candid interviews, podcasts, or long-form discussions by any team member. This is the most significant gap for understanding the team's actual thinking.
- No documented explanation for three of four founding members departing.
- No formal legal entity despite handling grants and organizing a competition.
- No transparent budget or financial reporting.
- No explicit theory-of-change document.
- No public response to the "deception without lying" critique (Berger 2026). The competition still uses the assertion-based definition.
- Manifund funding details remain unknown (429 errors prevented access).
Recommended Reading
"Still No Lie Detector for Language Models" (Levinstein & Herrmann, 2023) -- The strongest challenge to the probing foundations. Essential for understanding the intellectual terrain. https://arxiv.org/abs/2307.00175
"Searching for a model's concepts by their shape" (Kaarel Hanmi, LW, Feb 2023) -- The founding theoretical vision, broader than current Cadenza work. Shows what motivated the research program. https://www.lesswrong.com/posts/Go5ELsHAyw7QrArQ6/
Liars' Bench paper (Kretschmar et al., 2025) -- Cadenza's flagship output. The taxonomy and failure-mode results are the core contribution. https://arxiv.org/abs/2511.16035
"There should be more AI safety orgs" (Hobbhahn, LW, 2023) -- Context for ecosystem position. Explicit praise of this work, Walter's comment about funding needs. https://www.lesswrong.com/posts/MhudbfBNQcMxBBvj8/