Skip to content

Related Work

Status: early draft / working bibliography. Evaluation engineering is not a standalone idea — it’s the accumulated agenda of the Quantified Uncertainty Research Institute (QURI) and collaborators, restated. This page maps the most relevant published work to the parts of the field it bears on. See the EA Forum QURI topic for the broader list, and Adjacent Fields & Literature for the external academic literatures (forecasting, decision analysis, program evaluation, LLM evals, scalable oversight, estimation/ontology tooling).

Two things in this corpus are scarce and worth foregrounding: empirical results (small but real experiments, listed below) and deployed implementations (running systems with published usage data and honest failure admissions). They are what distinguish a considered agenda from one more framework post.

  • Prediction-Augmented Evaluation Systems (Ozzie Gooen, LessWrong, 2018). The original “predict the evaluation” idea — the direct ancestor of prediction–evaluation systems. The wiki’s whole estimation/evaluation-bridging move is implicit here.
  • (Highly Optimized) Evaluations Are All You Need and the earlier Advanced / Symbolic Evaluation Systems drafts. The cause-area statement this wiki is built from. See Lineage.

Empirical results (the scarce, valuable part)

Section titled “Empirical results (the scarce, valuable part)”

These are quotable, dated experiments — exactly the “concrete case studies” the field is short on (see Objections & FAQ).

  • Amplifying generalist research via forecasting, Part 1 (models/challenges) and Part 2 (results) (Gooen, Sempere, et al., 2019). The flagship test of prediction–evaluation: crowd forecasters predicting a trusted evaluator recovered a large share (reported ~73%) of the evaluator’s benefit-cost signal, far cheaper. One of very few real experiments in this space.
  • An experiment to evaluate the value of one researcher’s work (EA Forum, 2019). Elicitation of value estimates over research outputs.
  • Predicting the value of small altruistic projects (Nuño Sempere, 2020). Proof-of-concept that forecasters can discriminate project value pre-execution — with a documented failure mode: systematic optimism.
  • Relative-value elicitation experiments (Open Phil AI-safety grants, 2022; valuing research works, 2022). Real data on inter-rater disagreement and how it aggregates.
  • Squiggle (squiggle-language.com; GitHub). A small language for probabilistic estimation — the working instance of estimation functions.
  • Squiggle AI (2025). An LLM (Claude) front-end that generates Squiggle models — a deployed estimation system, with published early usage data and a frank writeup of systematic overconfidence in generated estimates.
  • Scorable Functions (2024). The estimator-as-program object, later partially retracted (the author flagged that LLM-on-demand estimates may dominate pre-built functions) — useful lessons-learned.
  • Guesstimate (2016). The early spreadsheet-style tool that motivated much of this; see Use Cases.
  • Metaforecast (metaforecast.org). Aggregates and searches forecasts across platforms — infrastructure for the ontology layer.
  • Foretold.io (EA Forum, 2019). An open-source prediction registry; early structured-forecasting plumbing.
  • Relative Value Functions: A Flexible New Format for Value Estimation (EA Forum, 2023), plus the Utility Function Extractor and comparison-polling tools. The closest thing to a methods stack behind the evaluation-methods page’s open “elicitation” questions.
  • RoastMyPost (2025). A deployed LLM-plus-code tool that evaluates posts and research documents for errors, fallacies, and inaccuracies — a running evaluation system with multiple evaluator types.
  • Shallow evaluations of longtermist organizations (Sempere, 2021). A real, scaled-down charity-evaluation effort; the kind of “shallow but useful” output skeptics have found valuable.
  • Quantifying Uncertainty in GiveWell’s GiveDirectly Cost-Effectiveness Analysis (Sam Nolan, 2021). Putting distributions on a real CEA — estimation in the charity-evaluation domain.
  • Incentive Problems / Alignment Problems with Current Forecasting Platforms (Sempere & Lawsen, 2020–21). The concrete catalogue of reward-specification failures — directly relevant to whether prediction–evaluation incentives survive gaming.
  • Prediction Markets in the Corporate Setting (Sempere & Yagudin, 2021). An honest negative result on why organizations reject internal markets (tooling, question-writing cost, social disruption) — feeds Epistemic Culture and Objections.
  • Opinion Fuzzing (2025). Evidence that LLM judgments shift substantially on prompt phrasing alone, and more across models/personas — a caution for evaluation reliability.
  • Accuracy Agreements (2023). Pay-per-bit scoring contracts — a trust-network-adjacent incentive design.
  • Can We Place Trust in Post-AGI Forecasting Evaluations? (2019) → AI for Resolving Forecasting Questions / Epistemic Selection Protocols (2025). The deferred-resolution thread: how to ground evaluations when the resolver is itself an AI. Overlaps heavily with the sibling RRP wiki.

A note on sourcing. Specific figures above (e.g. the ~73% amplification result) are quoted from QURI’s published posts and the wiki’s internal corpus survey; check them against the linked originals before relying on them. This list is not exhaustive — additions welcome.