← AI Safety Orgs

Metaculus

Forecasting

Community forecasting platform.

Founded
2015
HQ
Santa Monica, CA
Team
28
Structure
PBC
Model
Mixed

Theory of Change

Metaculus's stated theory of change is that aggregating probabilistic predictions from a community of forecasters produces accurate foresight that improves decision-making on globally important topics -- AI timelines, biosecurity, geopolitics, climate. The platform creates "epistemic infrastructure" by defining precise questions, soliciting probability estimates, and aggregating them into community predictions that outperform individual experts.

However, the founder tells a different story. Anthony Aguirre (who started Metaculus simultaneously with FLI to serve FLI's planning needs) says it has been "surprisingly little" use:

"Once you've really carefully defined a question... whether that thing is 70% likely or 80% likely, nobody cares. It barely ever matters... But the process of getting to the point of knowing exactly what X is and having a well-defined question and keeping track of who makes good predictions and who doesn't... those things are really valuable." (AXRP, Feb 2025)

Current CEO Deger Turan (since April 2024) has reframed the theory around "epistemic security" -- using forecasting not just for prediction accuracy but as a tool for building shared world models, conditional policy analysis, and democratic deliberation. He acknowledges: "We already have really accurate forecasts, and we haven't had the Cambrian explosion of every single entity, corporation, government perpetually using forecasts, because the framing of the forecasts has not proven to be helpful just yet."

What They Do

Metaculus operates an online crowd forecasting platform (not a prediction market -- no money changes hands on predictions). Users submit probability estimates on precisely defined, resolvable questions. These are aggregated into a community prediction using recency-weighted medians. The platform has 2.9M+ predictions across 21K+ questions, with 9K+ resolved. All users are scored on calibration and accuracy via logarithmic scoring rules.

Key activities and outputs:

  • AI Forecasting Benchmark Series: Year-long tournament comparing AI bots vs human forecasters. $175K in prizes. Results show AI rapidly catching up -- in Summer 2025, Mantic (AI bot) placed 8th out of 549 forecasters, the first bot in the top 10.
  • FutureEval (Feb 2026): Benchmark projecting AI will pass community forecasters by April 2026 and pro forecasters by mid-2027.
  • Pro Forecaster services: Top 2% of users are hired for paid private forecasting for clients including Bridgewater Associates, CDC, GiveWell, FAS, IST.
  • Policy partnerships: FAS Climate Tipping Points Tournament, IST AI + Nuclear Landscape forecasting, CDC FluSight (3 years running).
  • Open source: Entire codebase released June 2024 under BSD-2-Clause license. A third of contributors are now external.

Track record highlights: predicted Ukraine invasion 2 weeks early at high probability; COVID death forecasts with 12.2% error rate; outperformed FiveThirtyEight and all prediction markets on 2022 midterms (though this is "basically just one data point" per the analyst who measured it).

Key People

Anthony Aguirre -- Founder and Chairman. Theoretical physicist (UC Santa Cruz), co-founder and Executive Director of Future of Life Institute. Started Metaculus and FLI simultaneously around 2015. Now "pretty much full-time" at FLI, making him a part-time Chairman of Metaculus.

Deger Turan -- CEO since April 2024. Background in collective intelligence and NLP. Previously president of AI Objectives Institute (sociotechnical alignment nonprofit). Founded Cerebra Technologies (discourse analysis for 300M+ people). His vision emphasizes "epistemic security," conditional forecasts, and "mini-Metaculuses" (focused forecasting instances).

Gaia Dempsey -- Former CEO, now Special Advisor and board member. Oversaw the major Open Phil grants and PBC conversion. Transitioned to advisory role when Turan joined.

Team size is approximately 28 people, fully remote across 4 continents.

Money and Incentives

Total known funding: ~$12.54M

Source Amount Dates
Open Philanthropy/Coefficient Giving $11,882,400 2019-2024 (5 grants)
EA Infrastructure Fund $308,043 2021
Metaplanet $175,000 2024
NSF $150,000 2023
FTX Future Fund $20,000 2022
Survival and Flourishing Fund $750,000 2022-2024

Single-funder concentration: Open Philanthropy provides ~87% of known grant funding. SFF's $750K provides some diversification but CG dominance is extreme. This is extreme dependency. If Open Phil shifts priorities, Metaculus faces an existential funding crisis. The three largest Open Phil grants ($5.5M, $3M, $2.75M) are all for "Platform Development."

Legal structure: Public Benefit Corporation (Delaware) since September 2022. As a PBC (for-profit), Metaculus has no obligation to publish financial data. No 990 filings exist. No public annual reports.

Revenue model: Estimated $3.9M/year total revenue (third-party estimate). Revenue breakdown between grants and commercial services (Pro Forecaster engagements, tournament hosting) is unknown. Enterprise clients include Bridgewater, CDC, GiveWell, FAS, IST.

Equity structure: Unknown. The PBC structure allows equity investors. No information about whether grants came with equity, or who holds ownership stakes.

Key incentive tensions: (1) Open Phil funds Metaculus as infrastructure for its own decisions, creating a principal-agent dynamic. (2) PBC structure means philanthropic funding could create value that accrues to equity holders rather than the mission. (3) AI lab credit donations for tournaments create soft dependencies on labs whose systems are being benchmarked.

What Others Say

Strongest criticism (scoring system): Ross Rheingans-Yoo showed mathematically that Metaculus's scoring creates "no-information arbitrage" -- users can earn points without adding useful information by predicting the median. The community-relative scoring component "actively incentivizes gaming." This was prompted by Zvi Mowshowitz creating an account and immediately quitting. Nuno Sempere confirmed these incentive problems persist in an arXiv paper.

Strongest criticism (decision-relevance): Sempere estimated the value of Metaculus questions at ~$225K/year. Most questions "fail to directly influence decisions." Many are either too narrow to matter or too large-scope to be influenceable.

Strongest criticism (x-risk forecasting): Normaltech argues existential risk forecasts are "too unreliable to inform policy." Subjective probabilities "vary by orders of magnitude" between experts. There is no way to measure forecaster skill on unique/rare events. Even the best scoring rules require "a hundred million" forecasts to detect systematic overestimation of 1% tail risks. Forecasters are essentially making up numbers.

Strongest criticism (AI predictions): Rethink Priorities found Metaculus AI forecasters were overconfident on numeric questions (2/16 resolved within the 25-75% confidence interval) and slightly biased toward predicting faster progress than occurred.

Defense: Metaculus demonstrably outperforms prediction markets and individual experts on many questions. Its COVID and Ukraine forecasts were genuinely useful. The reputation-based system solves the "honest forecasting" problem without financial stakes. And the 2022 midterm performance, while one data point, showed meaningful advantage over FiveThirtyEight and all prediction markets.

What's Absent

  • No public financial statements or annual reports despite $12M+ in philanthropic funding.
  • No documented case of an institution changing a policy or allocation based on a Metaculus forecast (only individual anecdotes).
  • No public response to the Rheingans-Yoo scoring system critique.
  • No information about equity structure or ownership.
  • No independent board members confirmed -- all known board members are insiders.
  • No systematic measurement of forecasting impact (an organization that quantifies everything has never quantified its own impact).
  • Progress on the "mini-Metaculus" vision (central to Turan's strategy) is unknown 1.5+ years after announcement.
  • User demographics and active user counts are unpublished -- fewer than 1,500 people have individually forecast on the most prominent AGI question.

Recommended Reading

  1. Anthony Aguirre on AXRP (Feb 2025) -- Most candid source. Founder admits predictions are "surprisingly little" use to FLI. The value is in question definition, not the numbers themselves. https://axrp.net/episode/2025/02/09/episode-38_7-anthony-aguirre-future-of-life-institute.html

  2. Ross Rheingans-Yoo: "Metaculus has some issues" (Feb 2021) -- Strongest substantive criticism. Mathematical proof that the scoring system incentivizes gaming rather than information injection. https://blog.rossry.net/metaculus/

  3. Deger Turan on Cognitive Revolution (Aug 2024) -- 2-hour interview with the new CEO. Full vision for "epistemic security" and the strategic pivot toward decision-relevance. https://www.cognitiverevolution.ai/scaling-superforecasting-ai-forecasting-tournaments-road-to-epistemic-security-with-deger-turan/

  4. Normaltech: "AI existential risk probabilities are too unreliable to inform policy" (Jul 2024) -- Fundamental critique of using crowd forecasts for policy on unprecedented events. Directly challenges Metaculus's most visible use case. https://www.normaltech.ai/p/ai-existential-risk-probabilities

  5. Nuno Sempere: "An Estimate of the Value of Metaculus Questions" (Oct 2021) -- Attempts to quantify decision-relevance. The $225K/year estimate vs. $12M+ in funding is illuminating. https://nunosempere.com/blog/2021/10/22/an-estimate-of-the-value-of-metaculus-questions/

Show Claude’s analysis
An opinionated read. Read the brief first to form your own view.

Stated Theory of Change

Metaculus's stated theory is a three-step causal chain:

  1. Precise, resolvable questions are defined on globally important topics (AI, biosecurity, geopolitics, climate).
  2. A community of calibrated forecasters submits probability estimates, which are aggregated using track-record-weighted algorithms to produce predictions more accurate than individual experts.
  3. Decision-makers at institutions, governments, and organizations use these predictions to make better-informed choices, reducing global risks and improving outcomes.

Under CEO Deger Turan, the stated theory has expanded: forecasting is now framed as a building block for "epistemic security" -- using conditional forecasts, world models, and "mini-Metaculus" focused instances to support democratic deliberation, resource allocation, and policy analysis.

Revealed Theory of Change

The actions tell a different story from the stated theory. Several disconnects emerge:

The product that gets funded is not the product that gets used. Open Philanthropy has invested $11.88M primarily in "Platform Development" -- building better forecasting infrastructure. But Aguirre admits the predictions themselves are "surprisingly little" use, and Sempere estimates only ~$225K/year in decision-relevant value from all Metaculus questions combined. The gap between investment and measured impact is enormous.

The most visible output serves a signaling function, not a decision function. Metaculus's AGI timeline questions are its most cited product -- referenced in The Economist, Bloomberg, Fortune, and by researchers like Toby Ord. But they function as "here's what the community thinks" descriptive statistics, not as decision-relevant inputs. Fewer than 1,500 people have individually forecast on the flagship AGI question.

The AI benchmarking pivot may be the real revealed theory of change. The most energy-intensive recent activity is the AI Forecasting Benchmark Series -- pitting AI bots against human forecasters. This suggests Metaculus's actual contribution may be as a benchmark platform for AI capabilities, not as a source of decision-relevant human forecasts.

The commercial arm is underdeveloped relative to the ambition. Pro Forecaster services exist but revenue is undisclosed. The Bridgewater partnership appears to be more about talent discovery than purchased forecasts. The for-profit PBC structure implies commercial aspirations, but the organization appears heavily grant-dependent.

Key Assumptions

Assumption 1: Aggregated crowd predictions are more accurate than alternatives.

  • Evidence for: Brier scores of 0.107-0.126. Outperformed prediction markets on 2022 midterms. COVID and Ukraine forecasts were genuinely ahead of conventional wisdom.
  • Evidence against: Overconfidence on AI numeric questions. The 2022 midterm advantage is "basically just one data point." AI bots are rapidly approaching and may surpass community accuracy.
  • Testable: Yes, through continued resolution of questions and cross-platform comparisons.
  • If wrong: The entire platform loses its accuracy advantage, which is its primary selling point.

Assumption 2: Accurate forecasts lead to better decisions.

  • Evidence for: The Ukraine case (one user moved based on forecasts). CDC FluSight participation. Aguirre says some people in the AI community "take seriously" Metaculus AGI predictions.
  • Evidence against: Aguirre says the difference between 70% and 80% "doesn't really change what you do." Sempere finds most questions don't influence decisions. No documented institutional decision changed by Metaculus.
  • Testable: In principle yes, but Metaculus has never attempted to measure this.
  • If wrong: Metaculus is an expensive hobby for probabilistically-minded people with no real-world impact. The theory of change chain is broken at the critical link.

Assumption 3: The community is large and diverse enough for genuine "wisdom of crowds."

  • Evidence for: 2.9M predictions, some questions have hundreds of forecasters.
  • Evidence against: Fewer than 1,500 individual forecasts on the flagship AGI question. The community is heavily drawn from EA/rationalist circles. Normaltech argues this creates selection bias toward x-risk concerns.
  • Testable: Yes, through demographic surveys and comparison with broader populations.
  • If wrong: "Wisdom of crowds" becomes "consensus of a subculture" -- aggregating correlated biases rather than independent information.

Assumption 4: The scoring system incentivizes honest, information-adding forecasts.

  • Evidence for: It's a proper scoring rule (mathematically incentivizes truthful probability reporting). Users report it creates a collaborative culture.
  • Evidence against: Rheingans-Yoo proved the meta-game incentivizes no-information arbitrage. Sempere's arXiv paper confirms tournaments distort incentives. Zvi Mowshowitz quit immediately after seeing the system.
  • Testable: Yes, by analyzing forecast distributions for herding behavior and median-anchoring.
  • If wrong: The aggregation mechanism is contaminated by strategic rather than epistemic behavior.

Strengths

  1. Genuine accuracy advantage: On well-defined, resolvable questions in standard domains (elections, epidemiology, geopolitics), Metaculus community predictions demonstrably outperform naive baselines and are competitive with or better than prediction markets and expert panels.

  2. The AI benchmarking initiative is well-timed and well-designed: FutureEval creates a rigorous, publicly accessible benchmark for AI forecasting capabilities. This could become Metaculus's most lasting contribution -- a standardized way to measure how AI approaches human-level judgment on real-world questions.

  3. Open source and platform infrastructure: The BSD-2-Clause codebase provides genuine public goods. A third of contributors are external. This creates lasting infrastructure regardless of whether Metaculus itself succeeds commercially.

  4. Honest forecasting without financial stakes: Metaculus solves a real problem -- getting honest probability estimates without the leverage, manipulation, and adversarial dynamics of prediction markets. This is a meaningful innovation.

  5. Partnerships extending beyond the EA bubble: FAS, IST, CDC, Bridgewater, and EL PAIS are mainstream institutions. The policy-relevant tournament model (climate tipping points, AI+nuclear) shows promise for connecting forecasts to actual decisions.

Weaknesses and Risks

  1. Extreme funder concentration: 95% of known funding from a single source (Open Philanthropy). This is the most obvious existential risk. If OP shifts priorities, Metaculus has no viable alternative funding pathway demonstrated.

  2. The founder doesn't believe in the core product: Aguirre's candid assessment that predictions "barely ever matter" is deeply concerning. When the founder of an organization publicly says its output is of marginal value, that should be taken seriously. His claim that the process of question definition is the real value has never been translated into a funded program.

  3. No demonstrated decision impact at scale: After 10+ years and $12M+ in funding, there is no documented case of an institution changing a policy based on Metaculus. Individual anecdotes (one Ukrainian user) and partnership announcements (FAS, IST) don't constitute evidence of decision-relevant impact. The theory of change chain breaks at "decision-makers use predictions."

  4. AI disruption is existential: Metaculus's own FutureEval projects AI will surpass community forecasters by April 2026 and pro forecasters by mid-2027. If AI can forecast as well as or better than humans at negligible marginal cost, the human community -- Metaculus's core asset -- becomes less central. The pivot to "benchmark platform" is necessary but represents a fundamental identity shift.

  5. Financial opacity for a philanthropically-funded PBC: Receiving $12M+ in grant funding with zero public financial reporting is a governance failure. The PBC structure means value created could accrue to equity holders rather than the mission, and there's no way to verify this from public information.

  6. Scoring system problems are fundamental, not cosmetic: The no-information arbitrage identified by Rheingans-Yoo and the tournament incentive distortion identified by Sempere strike at the core mechanism. These aren't edge cases -- they affect every forecaster's strategic behavior and therefore the quality of every aggregate prediction.

Cross-References

  • Relationship to Open Philanthropy/Coefficient Giving: Metaculus is essentially an Open Phil portfolio company, with 95% funding dependency. This makes OP the de facto principal; Metaculus is the agent.
  • Relationship to Future of Life Institute: Deep structural connection (shared founder). FLI partnership on AI Pathways report. But Aguirre says the cross-pollination has been "surprisingly little."
  • Relationship to Good Judgment Inc: Collaboration not competition. Parallel forecasting on shared questions. GJI represents the "superforecaster" approach (small elite teams) vs. Metaculus's crowd approach.
  • Complementary to RAND's Forecasting Initiative: RAND's Vassalo explicitly sees AI forecasters as enabling his team to track hundreds of questions and deploy humans selectively. Metaculus is the platform that makes this possible.
  • Competitive with Manifold, Polymarket: These are prediction markets (money on the line). Metaculus's reputation-based approach is different but occupies overlapping epistemic territory.
  • Overlapping with QURI (Quantified Uncertainty Research Institute): QURI (founded by Ozzie Gooen, who commented on Sempere's analysis) focuses on forecasting infrastructure and epistemic tools. Adjacent but not directly competitive.

What Would Change This Assessment

  • Documented institutional decision impact: If a government agency, major foundation, or corporation published evidence that a Metaculus forecast changed a consequential resource allocation or policy decision, that would significantly strengthen the theory of change.
  • Sustainable commercial revenue: If Metaculus demonstrated that >50% of revenue comes from paying clients (not grants), it would validate the product-market fit and reduce the funder-dependency concern.
  • Scoring system reform with empirical validation: If Metaculus publicly addressed the Rheingans-Yoo critique and demonstrated reduced gaming behavior through reformed incentives, the core mechanism concern would be substantially mitigated.
  • Mini-Metaculus operational success: If the "mini-Metaculus" vision produced even 2-3 functioning instances where focused forecasting demonstrably influenced a decision (e.g., a city budget allocation), Turan's expanded theory of change would gain credibility.
  • AI surpassing humans on FutureEval: This would force a fundamental reassessment. If AI forecasters are cheaper and more accurate, Metaculus would need to pivot entirely from "human community platform" to "AI forecasting infrastructure." This changes the entire value proposition.

Self-Critique

What sources should I have checked but didn't?

  • The "State of Metaculus" retrospective posts on LessWrong/EA Forum were inaccessible. These likely contain the most self-reflective assessment of progress and challenges.
  • Metaculus.com pages (about, team, FAQ, services) were inaccessible to scrapers due to JavaScript rendering. These might contain more specific claims about impact and approach.
  • Internal Metaculus community discussions about scoring reforms.

Where is this analysis potentially biased?

  • I may be overweighting the Rheingans-Yoo scoring critique because it's mathematically precise and satisfying. In practice, most forecasters may not be gaming the system this way.
  • I may be underweighting the intangible value of community building and epistemic norm-setting, which is hard to measure but could be Metaculus's most important contribution.
  • The analysis reflects a bias toward requiring demonstrated decision impact. It's possible that improving collective epistemics has diffuse value that doesn't manifest as identifiable "decision changed" events.

What would a thoughtful person who disagrees say? "You're judging Metaculus by an impossibly high standard. Decision-makers don't cite specific inputs that changed their mind -- they absorb information from many sources. Metaculus AGI timeline data has genuinely shifted the Overton window on how seriously people take near-term AI risk. The COVID dashboards and Ukraine predictions saved lives. The platform has trained thousands of people to think probabilistically, which has compounding returns. You can't measure these diffuse effects, but they're real. And comparing $225K/year in measured value to $12M+ in funding is unfair because Sempere only measured direct question-level impact, not the platform's role in building forecasting infrastructure and culture."

What's my single weakest claim? The claim that Metaculus has "no demonstrated decision impact at scale." I'm using an absence of documented evidence as evidence of absence. It's entirely possible that CDC staff, FAS policy analysts, or Open Phil grantmakers regularly consult Metaculus forecasts and it genuinely influences their decisions -- they just haven't published about it. The decision-relevance gap may be a documentation gap rather than an impact gap.

What information would most change my view? A candid interview with a decision-maker at a Metaculus partner organization (CDC, FAS, IST, Open Phil) who says: "We were going to do X, consulted the Metaculus forecast, and changed to Y because the forecast showed Z." One concrete, verified instance of this would substantially update my assessment from "theory of change is broken at the decision link" to "theory of change works but is undersold."

Connected to (7)

Future of Life Instituteboard overlap · Anthony Aguirre
Bridgewater Associatescollaborator
AI Objectives Institutestaff from · Deger Turan
Effective Institutions Projectboard overlap · Gaia Dempsey
Federation of American Scientistscollaborator
Institute for Security and Technologycollaborator
Good Judgment Inccollaborator
Sources (33)
Every URL that was read during research.
  1. 1.Metaculus - Wikipediaen.wikipedia.org
  2. 2.Anthony Aguirre - Wikipediaen.wikipedia.org
  3. 3.An examination of Metaculus’ resolved AI predictions and their implications for AI timelinesrethinkpriorities.org
  4. 4.Entropic Thoughtsentropicthoughts.com
  5. 5.38.7 - Anthony Aguirre on the Future of Life Instituteaxrp.net
  6. 6.214: Predicting the future of science and tech, with Metaculus (Anthony Aguirre)rationallyspeakingpodcast.org
  7. 7.Metaculus (Anthony Aguirre Interview)nonprophetspod.wordpress.com
  8. 8.Metaculus Launches FutureEval to Track AI Forecasting Accuracyglobenewswire.com
  9. 9.Metaculus — Grokipediagrokipedia.com
  10. 10.Good Judgment Inc and Metaculus Launch First Collaboration - Good Judgmentgoodjudgment.com
  11. 11.IST to Partner with Metaculus on Forecasting the AI and Nuclear Landscapesecurityandtechnology.org
  12. 12.What can we learn from scoring different election forecasts?firstsigma.substack.com
  13. 13.Vehbi Deger Turan - Federation of American Scientistsfas.org
  14. 14.AI Is Learning to Predict the Future—And Beating Humans at Ittime.com
  15. 15.Compare Best Prediction Markets 2025thebestpredictionmarkets.com
  16. 16.Forecasting - Metaculus Q4 2024 Retrospectiveobrhubr.org
  17. 17.Anthony Aguirreanthony-aguirre.com
  18. 18.GitHub - Metaculus/metaculusgithub.com
  19. 19.Anthony Aguirre - Future of Life Institutefutureoflife.org
  20. 20.How well did experts and laypeople forecast the size of the COVID-19 pandemic?journals.plos.org
  21. 21.Evaluating Predictions of Model Behaviour | GovAIgovernance.ai
  22. 22.Scaling Superforecasting : AI Forecasting Tournaments & Road to Epistemic Security, with Deger Turancognitiverevolution.ai
  23. 23.Bridgewater x Metaculus 2026 Competitionbridgewater.com
  24. 24.Forecasting and related research and implementation | Career review | 80,000 Hours80000hours.org
  25. 25.Superhuman Automated Forecasting | CAISsafe.ai
  26. 26.FAS and Metaculus are Using Forecasting to Support Better Climate Policy - Federation of American Scientistsfas.org
  27. 27.Alignment Problems With Current Forecasting Platformsarxiv.org
  28. 28.FutureSearchfuturesearch.ai
  29. 29.Metaculus has some issuesblog.rossry.net
  30. 30.Measure is unceasingnunosempere.com
  31. 31.Measure is unceasingnunosempere.com
  32. 32.AI existential risk probabilities are too unreliable to inform policynormaltech.ai
  33. 33.Effective Altruism Funds: Effective Altruism Infrastructure Fund donations made to Metaculusdonations.vipulnaik.com