← AI Safety Orgs

Meaning Alignment Institute

Research

Values alignment through meaning. Philosophical.

Founded
2023
HQ
Berlin, Germany / San Francisco, CA
Team
3
Structure
fiscally sponsored
Model
Grants

Theory of Change

MAI's theory of change begins with a philosophical claim: current AI alignment approaches fail because they confuse values with preferences, slogans, and rules. They define values as "constitutive attentional policies" -- what people pay attention to when making choices that are part of living well, not merely instrumental. Their causal chain:

  1. Develop rigorous definitions of what human values actually are (drawing on Charles Taylor, Ruth Chang, Amartya Sen)
  2. Build democratic processes that elicit these values from diverse populations and aggregate them into a "moral graph" -- a directed graph where edges represent broad agreement that one value is wiser than another in context
  3. Use this moral graph to align AI systems (replacing RLHF/Constitutional AI), economic mechanisms, and democratic institutions

In their own words: "We're a research organization. We're not just working on AI alignment, we're working on re-envisioning the whole stack around meaning, including AI but also including institutions like democratic institutions and markets eventually." (Oliver Klingefjord, 2025)

The Full-Stack Alignment paper (Dec 2025) frames this as a third paradigm -- "thick models of value" (TMV) -- beyond preferentist models of value (PMV, i.e. utility functions) and values-as-text (VAT, i.e. constitutional AI principles).

What They Do

Core experiment: Democratic Fine-Tuning with 500 Americans (2023), funded by OpenAI's $100K democratic inputs grant. Participants used a chatbot to articulate values for contentious scenarios (abortion, parenting, weapons). Key results: 97% could articulate a value in the required format, 89% endorsed the moral graph as fair even when their value wasn't the winner, Republicans and Democrats converged on shared "wiser" values.

Model prototype: WiseLlama-8B, an open-source LLaMA model fine-tuned to display "model integrity" -- acting from coherent, inspectable values rather than rules or vague traits. Small user study (n=78) showed participants found it less "preachy" than default models and could predict its behavior from its values.

Papers: "What are human values, and how do we align AI to them?" (arXiv 2024) and "Full-Stack Alignment" (arXiv 2025, 30+ co-authors from MIT, Oxford, CMU, Stanford, Harvard). Neither peer-reviewed.

Expanding scope: Market intermediaries (2025-2026) -- AI-enabled economic mechanisms that pay suppliers based on measured human flourishing rather than engagement metrics. Planning a 200-person pilot.

Zero adoption: No AI lab, platform, or institution has deployed any MAI technology. Klingefjord confirmed in 2025: "No, I don't think there is any AI model or service that people know is trained using this approach." The OpenAI grant work was never integrated into any product due to OpenAI's internal turmoil in late 2023.

Key People

Joe Edelman -- Founder & Director. Philosopher-technologist who co-founded Center for Humane Technology with Tristan Harris and coined "Time Well Spent." Background in HCI and programming language design. Has been working on "what to optimize instead of engagement" since 2013. Not an ML researcher -- the intellectual framework is philosophical.

Ryan Lowe -- Research Ecosystem Lead. Co-creator of InstructGPT/RLHF at OpenAI, co-lead for GPT-4 alignment. Left OpenAI ~March 2024 to join MAI. 49K+ Google Scholar citations. His departure from OpenAI to join a 3-person nonprofit is arguably the strongest signal of belief in MAI's approach.

Oliver Klingefjord -- Co-Founder, Technical Lead. Swedish software entrepreneur (founded Potential.app, acquired 2022). Mathematics background. First author on the human values paper.

Team is 3-3.5 FTE, supplemented by 13-member research network of academics from major institutions and a 6-member advisory board including Aviv Ovadya (AI & Democracy Foundation), Brian Christian (author of "The Alignment Problem"), and Liv Boeree.

Money and Incentives

Total confirmed funding: ~$265K (OpenAI $100K, SFF $165K) Additional unconfirmed: ARIA (UK government, amount unknown), Manifund projects (amounts unclear) Total budget: Unknown -- fiscal sponsorship via Hack Foundation means no public 990 filings Business model: Pure grants and donations, no product revenue No Open Philanthropy / Coefficient Giving funding -- notable absence for an AI safety org of this age No compute dependencies or lab funding ties -- MAI is financially independent of frontier AI labs No structural conflicts of interest identified -- the OpenAI grant was a one-time competitive award, not an ongoing relationship

The financial picture is precarious. $265K confirmed over 2.5+ years of operation is very modest. The lean 3-person structure keeps burn rate low but limits execution capacity. The research network model (academics contributing from other institutions) extends reach without direct costs.

What Others Say

Peer-reviewed review of all 10 OpenAI Democratic Inputs teams (Moats & Ganguly, PMC 2025): Notes that as MAI's values become more universal, "they might also become less useful for action." Questions whether the entire Democratic Inputs program was taken seriously by OpenAI: "many of the teams also had the impression that OpenAI was not all that interested in the ultimate results."

Substack commenter (strongest direct critique found): "The appropriation of wisdom as something that trickles up from democracy and massive public processes is an interesting, but potentially dangerous claim... Have massive public processes historically proven to result in wisdom? Our collective is also an adolescent culture that is not trustworthy." Advocates for "wise elders" rather than mass participation.

LessWrong commenter (MiguelDev): "This approach demands a well-informed population capable of setting aside their biases."

Academic critique (Moral Disagreement paper, PMC 2025): Argues crowdsourcing, RLHF, and Constitutional AI all fail to accommodate reasonable moral disagreement. MAI's DFT is a more sophisticated variant but faces the same fundamental challenge.

LW community engagement: Minimal. The technical alignment community appears to have largely not engaged with MAI's work. This silence is itself a finding -- the approach sits at the intersection of philosophy, social choice theory, and alignment, a niche few technical researchers inhabit.

What's Absent

  • No formal benchmarks comparing DFT-aligned models against RLHF or CAI baselines. For an org proposing to replace these methods, this is the critical missing evidence.
  • No replication of the 500-person experiment. Single pilot, Americans only, three scenarios.
  • No adoption by any external entity after 2.5+ years.
  • No engagement with catastrophic risk -- MAI works on "what to align to," not "how to prevent AI from pursuing misaligned objectives." No treatment of deceptive alignment, rapid capability gain, or loss-of-control scenarios.
  • No peer-reviewed publications. Both papers are arXiv preprints.
  • No response from Anthropic to MAI's direct critique of Claude's constitution.
  • No substantive external critique specifically targeting MAI's approach. The technical alignment community has neither endorsed nor challenged the moral graph concept.

Recommended Reading

  1. Oliver Klingefjord interview -- Democracy Innovators (2025) -- the most candid explanation of what MAI does, how the moral graph works, and the founding story. Start here for unfiltered understanding. https://democracyinnovators.com/oliver-klingefjord-about-the-meaning-alignment-institute-and-how-to-bring-up-wisdom-in-a-collective/

  2. "Bringing AI participation down to scale" -- Moats & Ganguly (PMC 2025) -- the strongest external assessment, reviewing all 10 OpenAI Democratic Inputs teams. Questions fundamental assumptions. https://pmc.ncbi.nlm.nih.gov/articles/PMC12142630/

  3. "What are human values?" (arXiv 2024) -- the foundational technical paper. https://arxiv.org/abs/2404.10636

  4. "Model Integrity" (Substack, Dec 2024) -- the most concrete product vision. https://meaningalignment.substack.com/p/model-integrity

Show Claude’s analysis
An opinionated read. Read the brief first to form your own view.

Stated Theory of Change

MAI claims the fundamental obstacle to beneficial AI is conceptual: we don't understand what human values actually are, so we can't align AI to them. Their stated mechanism:

  1. Define values rigorously (not preferences, not slogans, not rules -- constitutive attentional policies grounded in philosophical tradition)
  2. Elicit values democratically at scale using LLM-powered interviews
  3. Aggregate values into a moral graph that surfaces wisdom rather than majority opinion
  4. Use this moral graph to train AI models, redesign economic mechanisms, and upgrade democratic institutions
  5. This produces AI systems with "model integrity" -- coherent, inspectable values that guide behavior in novel situations

The theory rests on a chain of philosophical claims: that values are distinct from preferences, that they can be elicited through careful interviewing, that wisdom can be aggregated through a graph structure, and that this can be operationalized for AI training. Each link in the chain is necessary for the theory to hold.

Revealed Theory of Change

MAI's actions largely track their stated theory, with some notable divergences:

Consistent: The research output -- the values paper, the 500-person experiment, the moral graph implementation, the WiseLlama prototype -- all directly implement the stated theory. The open-source ethos (code + data publicly available) reinforces the stated goal of building public goods.

Divergent -- scope expansion: The trajectory from DFT (2023) to model integrity (2024) to full-stack alignment (2025) to market intermediaries (2025-2026) reveals a pattern of expanding ambition faster than execution capacity. Each new frame is intellectually coherent but none of the previous ones has been validated at scale before the next one launches.

Divergent -- network vs. execution: The Full-Stack Alignment paper with 30+ co-authors looks more like academic coalition-building than technical execution. The gap between "we have a 3-person team" and "we want to redesign markets, democracy, and AI alignment" is extreme. The revealed theory of change may be: "build intellectual consensus among academics, then let that consensus influence how institutions approach these problems" -- which is different from "build and deploy aligned AI systems."

Divergent -- prioritization: MAI appears to prioritize theoretical completeness over empirical validation. They published the full-stack alignment paper before validating the basic DFT approach against standard baselines. The absence of benchmarks is not an oversight -- it reflects a genuine belief that getting the conceptual framework right matters more than incremental empirical results.

Key Assumptions

Assumption 1: The "what to align to" question is as important as the "how to align" question.

  • Evidence for: Anthropic's struggles with Claude's constitution, OpenAI's inability to specify values beyond vague helpfulness, the entire Constitutional AI debate
  • Evidence against: If inner alignment / deceptive alignment is the binding constraint, better alignment targets don't help. If capabilities outpace alignment regardless of target quality, the work is moot.
  • Testable: Yes -- compare outcomes of models trained on moral graphs vs. RLHF vs. CAI. MAI has not done this.
  • If wrong: MAI's work becomes a philosophical contribution with no practical impact on AI safety.

Assumption 2: Values are distinct from preferences and can be reliably elicited.

  • Evidence for: The 500-person experiment showed 97% of participants could articulate values in the required format, and most endorsed the result as representing "what they care about."
  • Evidence against: Single study, Americans only, paid Prolific workers. The moral disagreement paper argues these kinds of crowdsourced approaches systematically fail on genuinely controversial questions. The GPT-4-generated stories could bias what gets classified as "wiser."
  • Testable: Yes -- cross-cultural replications, adversarial testing with motivated reasoners, comparison with expert-driven approaches.
  • If wrong: Moral graphs capture something, but not genuinely "deeper" values -- just a different transformation of preferences.

Assumption 3: Wisdom can be aggregated democratically (the moral graph concept).

  • Evidence for: 89% fairness endorsement, expert values surfacing through PageRank.
  • Evidence against: The Substack critic's point -- "Have massive public processes historically proven to result in wisdom?" Small sample, no adversarial participants, no real-world consequences. The graph had only one cycle in 3 scenarios.
  • Testable: Yes -- larger experiments with real stakes, adversarial participants, more divisive scenarios.
  • If wrong: The moral graph is a clever data structure but doesn't actually aggregate toward wisdom.

Assumption 4: The approach scales to institutional transformation.

  • Evidence for: The theoretical framework is coherent across multiple domains.
  • Evidence against: Zero adoption after 2.5 years. Klingefjord himself says they lack capacity to even run a homelessness policy pilot. 3-person team with $265K confirmed funding.
  • Testable: Partially -- the market intermediaries pilot could provide early evidence.
  • If wrong: MAI remains an intellectual contribution that never translates to practice.

Strengths

  1. Genuinely novel intellectual contribution. The distinction between values and preferences, operationalized through constitutive attentional policies and the moral graph, is a real conceptual advance. It draws on serious philosophy (Taylor, Chang, Sen) that most alignment researchers ignore.

  2. Ryan Lowe's involvement. A co-creator of RLHF leaving OpenAI to join a 3-person nonprofit is extremely strong evidence that someone who has seen the limitations of RLHF from the inside believes MAI's approach is worth pursuing.

  3. No conflicts of interest. MAI has no financial ties to frontier labs, no compute dependencies, no board overlaps. This gives their critique of RLHF/CAI unusual credibility.

  4. Promising empirical signals. The 500-person pilot produced surprisingly strong results: bipartisan convergence on contentious topics, high fairness endorsement, expertise surfacing through PageRank. These deserve follow-up.

  5. Academic network. 30+ co-authors on FSA from MIT, Oxford, CMU, etc. demonstrates the intellectual framework is taken seriously by academics across disciplines.

Weaknesses and Risks

  1. No empirical validation against baselines. This is the critical weakness. After 2.5 years, there is no formal comparison showing moral-graph-aligned models are better (by any metric) than RLHF or CAI-aligned models. The claim that DFT is "better" is theoretical.

  2. Scope far exceeds capacity. A 3-person team with ~$265K in confirmed funding is attempting to: redefine human values, build democratic AI alignment processes, redesign economic mechanisms, and reform democratic institutions. Something has to give.

  3. No adoption path. The OpenAI connection dissolved. No lab, company, or government has expressed interest in deploying the approach. The "if you build a good enough framework, they will come" strategy is high-risk.

  4. The technical alignment community isn't paying attention. MAI's LW posts get modest engagement. No prominent alignment researcher has engaged substantively with the moral graph concept. This could mean the approach is seen as philosophically interesting but irrelevant to the safety-critical questions that dominate the field.

  5. GPT-4 dependency in the elicitation process. The entire moral graph construction depends on GPT-4 generating value transition stories and mediating the chat process. This introduces model bias into what's supposed to be a democratic process. MAI acknowledges this limitation but has no solution.

  6. Does not engage with catastrophic risk. MAI's theory of change assumes a world where AI integration is gradual enough for democratic processes to keep pace. If AI capabilities accelerate faster than institutions can adapt, the "full-stack" approach may be too slow.

Cross-References

Complementary to: Anthropic (who could use moral graphs instead of vague constitutional principles), Collective Intelligence Project (overlapping goals, different methods), RadicalxChange (overlapping institutional reform vision).

Tension with: MIRI/machine alignment approaches (MAI focuses on outer alignment targets; MIRI focuses on inner alignment/deception), ARC Evals/METR (evaluations focus on dangerous capabilities, not value alignment quality).

Gap-filling: MAI fills a genuine gap in the alignment ecosystem -- almost no one else is doing rigorous philosophical work on what alignment targets should look like. Most orgs take the alignment target as given (human preferences/feedback) and work on the technical challenge of getting AI to pursue it.

What Would Change This Assessment

Strongly positive updates:

  • A formal evaluation showing moral-graph-aligned models outperform RLHF/CAI models on value alignment metrics, resistance to manipulation, or user trust
  • Adoption by a frontier lab or major platform of any MAI technology
  • Cross-cultural replication of the 500-person experiment with similar results
  • Significant new funding (>$1M) enabling larger experiments and hiring

Strongly negative updates:

  • Cross-cultural replication showing the approach fails outside WEIRD populations
  • Evidence that moral graphs are trivially gameable by motivated actors
  • Key personnel departures (especially Ryan Lowe)
  • Continued absence of adoption after 5+ years of operation

Self-Critique

What sources should I have checked but didn't?

  • The full Jim Rutt podcast transcript from Jan 2023 (pre-MAI founding) and Oct 2025 (FSA discussion) -- I read portions but not the full transcripts. These could contain more candid statements about MAI's challenges.
  • The Clearer Thinking podcast with Joe Edelman -- not fully read.
  • Whether any technical alignment researchers have discussed MAI privately (Twitter/X threads, private forums).

Where is this analysis potentially biased?

  • I may be giving too much credit to the philosophical sophistication of the framework. Elegant philosophy doesn't guarantee practical relevance. The alignment community's silence could reflect genuine assessment that the work doesn't address binding constraints, not merely ignorance.
  • I may be overweighting Ryan Lowe's involvement as a validation signal. People leave jobs for many reasons.

What would a thoughtful person who disagrees say? "MAI is doing philosophy of value, not AI safety. The binding constraint on alignment is not 'what to align to' but 'how to prevent deceptive optimization.' Even if MAI perfectly solves the alignment target problem, it doesn't help if we can't ensure AI systems actually pursue the target. Meanwhile, the scope expansion (markets! democracy! economy!) looks like a nonprofit that can't show impact on its core mission, so it keeps reframing the mission to be bigger."

Single weakest claim? That Ryan Lowe's departure from OpenAI validates MAI's approach. It's a strong signal but I may be overweighting it. Lowe may have left OpenAI for personal reasons (burnout, disagreement with leadership) and chosen MAI as a comfortable landing spot rather than a conviction bet.

What information would most change my view? A rigorous, pre-registered evaluation comparing models trained on moral graphs against RLHF/CAI models. If MAI produces better models by standard metrics AND novel metrics, my assessment would shift dramatically positive. If the comparison shows no meaningful difference, my assessment would shift sharply negative.

Connected to (5)

ARIAcollaborator
AI Objectives Institutestaff from · Oliver Klingefjord
Center for Humane Technologystaff from · Joe Edelman
Collective Intelligence Projectcollaborator
OpenAIcollaborator · Ryan Lowe
Sources (54)
Every URL that was read during research.
  1. 1.Meaning Alignment Institutemeaningalignment.org
  2. 2.We are now "The Institute for Meaning Alignment"meaningalignment.substack.com
  3. 3.Introducing Full-Stack Alignmentmeaningalignment.substack.com
  4. 4.OpenAI x DFT: The First Moral Graphmeaningalignment.substack.com
  5. 5.Model Integritymeaningalignment.substack.com
  6. 6.What are human values, and how do we align to them?meaningalignment.substack.com
  7. 7.What are human values, and how do we align AI to them?arxiv.org
  8. 8.Oliver Klingefjord about the Meaning Alignment Institute and how to bring up wisdom in a collectivedemocracyinnovators.com
  9. 9.Joe Edelmannxhx.org
  10. 10.ClearerThinking.org Podcast | Aligning society with our deepest values and sources of meaning (with Joe Edelman)podcast.clearerthinking.org
  11. 11.Transcript of EP 325 – Joe Edelman on Full-Stack AI Alignment - The Jim Rutt Showjimruttshow.blubrry.net
  12. 12.Joe Edelman: Meaning-Aligned AIthegradientpub.substack.com
  13. 13.Introducing Democratic Fine-Tuningmeaningalignment.substack.com
  14. 14.How Prolific helped the Meaning Alignment Institute make AI more democratic | Prolificprolific.com
  15. 15.GitHub - meaningalignment/dft: Democratic Fine-tuning with a Moral Graphgithub.com
  16. 16.GitHub - openai/democratic-inputsgithub.com
  17. 17.Safeguarded AIaria.org.uk
  18. 18.Model Integritygreaterwrong.com
  19. 19.Model Integrity: MAI on Value Alignmentgreaterwrong.com
  20. 20.Model Integrity and Charactergreaterwrong.com
  21. 21.Hi there! I'm Oliver.klingefjord.com
  22. 22.elliehain.comelliehain.com
  23. 23.Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Valuearxiv.org
  24. 24.Transcript of Currents 080: Joe Edelman and Ellie Hain on Rebuilding Meaning - The Jim Rutt Showjimruttshow.blubrry.net
  25. 25.Full-Stack Alignment: Co‑Aligning AI and Institutions with Thick Models of Valuearxiv.org
  26. 26.Nymark Interviews: The Landmark Research behind "Wise AI" — Nymarknymark.agency
  27. 27.Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Valuefull-stack-alignment.ai
  28. 28.Market Intermediaries: A post-AGI Vision for the Economymeaningalignment.substack.com
  29. 29.Looking for testers for a Social Appmeaningalignment.substack.com
  30. 30.SFF-2024 S-Process Recommendations Announcement | Survival and Flourishing Fundsurvivalandflourishing.fund
  31. 31.meaningalignment (Meaning Alignment Institute)huggingface.co
  32. 32.Archive - Meaning Alignment Institutemeaningalignment.substack.com
  33. 33.Meaning Alignment Institute: Year in Reviewmeaningalignment.substack.com
  34. 34.David Shapiro Interviewmeaningalignment.substack.com
  35. 35.Model Integrity and Charactermeaningalignment.substack.com
  36. 36.Help us make ChatGPT wisermeaningalignment.substack.com
  37. 37.Democratic Fine-Tuninggreaterwrong.com
  38. 38.Joe Edelman, Values, Preferences, Meaningful Choice - PhilArchivephilarchive.org
  39. 39.Speakersradicalxchange.org
  40. 40.Meet the Creators | Safeguarded AIaria.org.uk
  41. 41.Bringing AI participation down to scalepmc.ncbi.nlm.nih.gov
  42. 42.Meaning Alignment Institutegithub.com
  43. 43.GitHub - meaningalignment/moral-graph-elicitationgithub.com
  44. 44.GitHub - meaningalignment/wise-aigithub.com
  45. 45.Get Startedtools.meaningalignment.org
  46. 46.Full-Stack Alignmentsimons.berkeley.edu
  47. 47.Moral disagreement and the limits of AI value alignment: a dual challenge of epistemic justification and political legitimacypmc.ncbi.nlm.nih.gov
  48. 48.Ryan Lowecs.mcgill.ca
  49. 49.Model Integrity and Charactermeaningalignment.substack.com
  50. 50.Joe Edelmanscholar.google.com
  51. 51.Oliver Klingefjordscholar.google.com
  52. 52.Introducing Full-Stack Alignmentmeaningalignment.substack.com
  53. 53.MiguelDev comments on Democratic Fine-Tuninggreaterwrong.com
  54. 54.Introducing Democratic Fine-Tuningmeaningalignment.substack.com