Skip to content

Adjacent Fields & Literature

Status: early draft / curated bibliography, assembled from a June 2026 literature sweep. Evaluation engineering is a synthesis, not a clean-sheet invention: most of its hard parts have been studied for decades under other names. This page maps the nine adjacent literatures, what each offers, and where each stops short of the systems/throughput question. For QURI’s own prior work, see Related Work.

1. Judgmental forecasting & prediction markets

Section titled “1. Judgmental forecasting & prediction markets”

The most direct empirical precedent for treating estimation as a measurable, optimizable pipeline. The IARPA tournaments and the Good Judgment Project showed that producing thousands of scored probabilistic estimates turns “forecasting” into an engineering problem whose components — talent selection, training, teaming, aggregation — each yield quantifiable accuracy gains. It hands evaluation engineering three reusable primitives: proper scoring rules (a metric that makes estimate quality measurable and incentive-compatible), aggregation theory (turning many cheap judgments into one accurate one), and two contrasting production architectures (polls vs. markets) with documented cost/accuracy/robustness trade-offs. Its gap: it optimizes accuracy and calibration but rarely treats cost or throughput as first-class, and says little about machine evaluators. Relates to Prediction and prediction–evaluation systems.

  • Identifying and Cultivating Superforecasters — Mellers, Tetlock, et al. (2015), Perspectives on Psychological Science. PDF — Talent-selection + teaming produce durable accuracy gains; the case for engineering a pipeline around component interventions.
  • Strictly Proper Scoring Rules, Prediction, and Estimation — Gneiting & Raftery (2007), JASA. PDF — The definitive treatment of Brier/log scoring; the formal foundation for any comparable, incentive-compatible estimate metric.
  • Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls — Atanasov, Tetlock, et al. (2017), Management Science. link — A randomized comparison of two production architectures; the core accuracy-vs-architecture trade-off.
  • Prediction Markets — Wolfers & Zitzewitz (2004), Journal of Economic Perspectives. PDF — Foundational survey of markets as a low-cost information-aggregation engine.
  • Corporate Prediction Markets: Evidence from Google, Ford, and Firm X — Cowgill & Zitzewitz (2015), Review of Economic Studies. link — Internal markets are well-calibrated yet biased and participation-dependent; the key evidence on why real evaluation systems fail in practice.
  • Shall We Vote on Values, But Bet on Beliefs? — Hanson (2013), Journal of Political Philosophy. PDF — Futarchy: using market-produced estimates to drive decisions. Direct ancestor of the decision-support framing; see Use Cases.

2. Decision analysis, value of information & expert elicitation

Section titled “2. Decision analysis, value of information & expert elicitation”

This is where evaluation engineering gets its triage rule. Decision analysis (Howard; Raiffa & Schlaifer) supplies the machinery — subjective probability, utility, and an explicit cycle in which uncertainty is quantified and its resolution priced. Value of information (EVPI/EVPPI/EVSI) operationalizes the most important question in the field: how much is a given estimate worth before you pay to produce it? Structured expert elicitation (Cooke’s Classical Method, SHELF, IDEA, Delphi) provides validated protocols for getting estimates out of people with measured accuracy and informativeness. Relates to Evaluation as a System (the accuracy × quantity × cost frontier) and the open elicitation questions on Evaluation Methods.

  • Information Value Theory — Howard (1966), IEEE Trans. SSC. (DOI 10.1109/TSSC.1966.300074) — The founding VOI paper: information’s value can only be assessed jointly with the decision it informs. The root of “which estimate is worth producing.”
  • Applied Statistical Decision Theory — Raiffa & Schlaifer (1961), Harvard. Wiley reissue — The Bayesian decision-theory foundation underpinning EVSI: the value of a partial evaluation.
  • A Review of Methods for the Analysis of the Expected Value of Information — Heath, Manolopoulou & Baio (2015), arXiv. link — How modern non-parametric methods made VOI tractable at scale — directly “many estimates at known cost.”
  • Experts in Uncertainty — Cooke (1991), Oxford. Internet Archive — Introduces the Classical Method: weight experts by measured accuracy on seed questions. The quality-control backbone of elicited evaluation.
  • A practical guide to structured expert elicitation using the IDEA protocol — Hemming et al. (2018), Methods in Ecology and Evolution. link — A deployable Investigate-Discuss-Estimate-Aggregate workflow for group elicitation.
  • Multiple Criteria Decision Analysis: An Integrated Approach — Belton & Stewart (2002), Springer. link — The standard reference for evaluating options against multiple non-commensurable criteria.

Cost-effectiveness analysis (QALYs/DALYs plus willingness-to-pay thresholds) is the largest real-world standardized, scaled evaluation system — a common unit and a calibrated price-per-unit letting thousands of interventions be compared. It is the operational descendant of Bentham’s felicific calculus (see Use Cases).

3. Program evaluation & impact measurement

Section titled “3. Program evaluation & impact measurement”

There is already a mature, professionalized field called Evaluation — fifty years of work on exactly the problem of credibly judging the value of interventions. Its lessons transfer directly and should not be reinvented: defining the evaluand and the valuing step is the hard part (Scriven); an estimate nobody acts on is wasted throughput (Patton’s utilization-focus); “what works” is incomplete without “for whom, in what context, via what mechanism” (Pawson & Tilley; Deaton’s critique of the RCT movement); and aggregating heterogeneous outcomes into one score is a known minefield of normalization and weighting politics (the OECD composite-indicators handbook). Where it differs from evaluation engineering: it is overwhelmingly bespoke, slow, and study-centric — rich on validity, use, and the contestation of value, but nearly silent on standardizing and scaling the production process (throughput, cost-per-judgment, pipeline reuse). That production layer is what’s left to build. Relates to the Estimation vs. Evaluation note that this field exists but is scoped differently.

  • Evaluation: A Systematic Approach — Rossi, Lipsey & Freeman (7th ed., 2004), SAGE. — The standard textbook; its needs → theory → process → outcome → efficiency division is a ready-made taxonomy of estimate types.
  • Utilization-Focused Evaluation — Patton (5th ed., 2021), SAGE. — Evaluations should be judged by their actual use by intended users. The core lesson for an estimate factory: volume is worthless without a decision attached to each output.
  • Goal-free evaluation / Key Evaluation Checklist — Scriven. KEC PDF — Measure actual effects without being cued by stated goals; a direct warning that goal/framing-anchoring is an attack surface on mass-produced estimates.
  • Realistic Evaluation — Pawson & Tilley (1997), SAGE. — The Context-Mechanism-Outcome framework; a caution against decontextualized point-estimates of effect.
  • Instruments of Development: Randomization in the Tropics — Deaton (2009). PDF — The canonical critique of the “randomista” movement; a counterweight to equating volume-of-rigorous-estimates with knowledge. (See also J-PAL / Banerjee & Duflo’s Poor Economics as the closest existing “evaluation-as-scaled-institution” analog.)
  • Handbook on Constructing Composite Indicators — OECD/JRC (2008). PDF — The authoritative methodology for rolling many indicators into one score — and a warning that weighting choices drive results. (The UNDP’s HDI is the long-running worked example.)
  • GiveWell cost-effectiveness analyses & moral weights — GiveWell, ongoing. link — The most evaluation-engineering-like artifact in philanthropy: a standardized, transparent, reusable pipeline whose admitted weak point is the value-laden weighting step.

The empirical foundation for the AI-cheap-evaluation wing of the field. The central finding: strong LLM judges can approximate expensive human judgment at roughly human-level agreement while collapsing cost-per-evaluation by orders of magnitude — exactly the accuracy × quantity × cost move. The field has matured from “can LLMs judge?” into a science of failure modes: reproducible judge biases (position, verbosity, self-preference), benchmark contamination, and metric artifacts. A parallel reward-modeling thread shows the systems lesson from the optimization side — any cheap proxy evaluator degrades via Goodhart once optimized against. It supplies both the tooling and the hazard catalog any evaluation system must engineer around; see also Objections & FAQ on evaluation reliability.

  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng, Chiang, et al. (2023), arXiv. link — The landmark: GPT-4-as-judge reaches ~80%+ agreement with humans (matching human–human), and names the position/verbosity/self-enhancement biases.
  • A Survey on LLM-as-a-Judge — Gu, Jiang, et al. (2024), arXiv. link — Frames “how to build a reliable LLM judge”; a good backbone for this section.
  • Length-Controlled AlpacaEval — Dubois, Liang & Hashimoto (2024), arXiv. link — A causal-inference debiasing estimator for verbosity gaming; exemplar of engineering a cheap evaluator to resist manipulation.
  • Holistic Evaluation of Language Models (HELM) — Liang, Bommasani, et al. (2022), arXiv. link — Evaluation as standardized, multi-metric, living infrastructure across many scenarios.
  • Are Emergent Abilities of LLMs a Mirage? — Schaeffer, Miranda & Koyejo (2023), NeurIPS (Outstanding Paper). link — Many “emergent” jumps are artifacts of the metric, not the model: the foundational construct-validity caution.
  • LLM Critics Help Catch LLM Bugs (CriticGPT) — McAleese et al. / OpenAI (2024), arXiv. link — Critic models help human evaluators; assisted humans beat unassisted ~60% of the time. The key human-AI teaming result.
  • Scaling Laws for Reward Model Overoptimization — Gao, Schulman & Hilton (2022), arXiv. link — Optimizing against a proxy reward degrades true reward predictably (Goodhart): a cheap evaluator breaks under optimization pressure.

5. Scalable oversight & AI-assisted reasoning

Section titled “5. Scalable oversight & AI-assisted reasoning”

How to evaluate outputs too hard or costly to check directly — i.e. ground or amplify a trusted-but-expensive evaluator using cheaper or AI-assisted processes. Two protocol families dominate: decomposition (iterated amplification / factored cognition; recursive reward modeling), which breaks an expensive evaluation into cheaper sub-evaluations, and adversarial/game-theoretic (debate; prover-verifier; market-making), which pits optimizers against each other so a weak judge can extract signal from their conflict. Sandwiching gives a way to measure whether a protocol actually closes the weak-evaluator-to-expert gap. Empirically: debate beats single-advisor baselines and helps non-experts reach expert accuracy; process supervision beats outcome supervision. Caveats: results are mostly on QA/math with artificial weak/strong gaps, and persuasiveness can be optimized independently of truth. This is the literature the sibling RRP wiki centers; here it bears on prediction–evaluation and resolution.

  • AI safety via debate — Irving, Christiano & Amodei (2018), arXiv. link — Two agents debate, a human judges; the blueprint for using adversarial structure to extend a limited evaluator’s reach.
  • Supervising strong learners by amplifying weak experts — Christiano, Shlegeris & Amodei (2018), arXiv. link — Iterated Amplification: build a training signal for hard problems by recursive decomposition into easy ones.
  • Scalable agent alignment via reward modeling — Leike et al. (2018), arXiv. link — Recursive reward modeling: use trained agents to help evaluate the next, harder task.
  • Debating with More Persuasive LLMs Leads to More Truthful Answers — Khan, Hughes, et al. (2024), ICML. link — Empirical evidence that debate lets weaker judges reach expert-level accuracy.
  • Let’s Verify Step by Step — Lightman et al. / OpenAI (2023), arXiv. link — Process supervision (grading each step) beats outcome supervision and yields stronger verifiers. Bears on what to evaluate.
  • Measuring Progress on Scalable Oversight for Large Language Models — Bowman et al. / Anthropic (2022), arXiv. link — Operationalizes “sandwiching” as an experimental paradigm — the most directly relevant way to benchmark an evaluation system’s accuracy-vs-cost.
  • Weak-to-Strong Generalization — Burns et al. / OpenAI (2023), arXiv. link — Strong models trained on weak labels can exceed their supervisors: bears on whether a cheap/weak evaluator’s signal can elicit what it can’t itself verify.

6. Estimation tooling, probabilistic programming & ontologies

Section titled “6. Estimation tooling, probabilistic programming & ontologies”

The computational substrate for the calculation and ontology components. Probabilistic programming (Stan, PyMC, Pyro, Church) lets a modeler declare a model once and get inference and uncertainty propagation automatically — estimation as a repeatable, composable computation. A lighter branch (Guesstimate, Squiggle, plus calibrated human Fermi estimation) targets fast estimation over many variables, closer to the throughput regime this field cares about. Ontologies and large knowledge bases (Gruber; the Semantic Web; Wikidata, Freebase, YAGO, NELL) address how to define and populate at scale the set of entities to estimate over — with a recurring tension between hand-curated and automatically-constructed knowledge that mirrors the symbolic-vs-statistical debate. Relates to estimation functions and the “ontology as silent bottleneck” claim on The Four Components.

  • Stan: A Probabilistic Programming Language — Carpenter, Gelman, et al. (2017), J. Statistical Software. link — The canonical Bayesian inference engine; the modern “specify model once, get estimates + uncertainty” template.
  • Probabilistic Programming in Python using PyMC3 — Salvatier, Wiecki & Fonnesbeck (2016), PeerJ CS. link — Scriptable probabilistic modeling in Python; fits programmatic generation of many models.
  • Church: a Language for Generative Models — Goodman, Mansinghka, et al. (2008), UAI. link — The foundational universal PPL; “estimation as executable generative model.”
  • A Translation Approach to Portable Ontology Specifications — Gruber (1993), Knowledge Acquisition. PDF — The origin of “ontology = a specification of a conceptualization”; foundational for structuring what to estimate over.
  • The Semantic Web — Berners-Lee, Hendler & Lassila (2001), Scientific American. link — The manifesto for machine-readable, linkable, typed knowledge as infrastructure.
  • Wikidata: A Free Collaborative Knowledgebase — Vrandečić & Krötzsch (2014), CACM. link — The largest open structured KB; a concrete source of the entities/relations a system would range over.
  • YAGO: A Core of Semantic Knowledge — Suchanek, Kasneci & Weikum (2007), WWW. link — Landmark automated KB construction (extracting facts from Wikipedia); harvesting rather than authoring an ontology. (See also NELL, Carlson et al., 2010, for never-ending machine-driven population.)

Metrology is the most mature discipline organized entirely around the property this field wants: producing measurements at known, traceable, documented uncertainty and cost. Four ideas transfer almost directly. Uncertainty budgets — enumerate every error source, classify each as Type A (statistical) or Type B (judgment/prior-based), and combine into one defensible figure — map onto reasoning about an evaluation pipeline’s total error. Traceability — the unbroken documented chain of calibrations linking a result back to a reference — is the provenance-chain analog that makes estimates auditable. Calibration against reference standards and proficiency testing / interlaboratory comparison give a template for periodically checking that evaluators (human or model) still agree on shared reference items. Metrology also offers a clean import: the split between error (the unknowable true deviation) and uncertainty (the quantifiable, reportable dispersion). What does not transfer: metrology presumes a stable, SI-anchored physical measurand, whereas evaluation targets are often contested, one-off, and non-stationary — so judgment (“Type B”) dominates and traceability must terminate in argument/provenance rather than a physical constant. Relates to Evaluation as a System (known cost/uncertainty; the capability ladder) and consistency checks.

  • Guide to the Expression of Uncertainty in Measurement (GUM) — JCGM 100:2008, BIPM/JCGM. link — The foundational formalism for Type A/Type B uncertainty, combined/expanded uncertainty, and the uncertainty-budget method: the closest existing thing to “report every estimate with a defensible error bar.”
  • International Vocabulary of Metrology (VIM) — JCGM 200:2012, BIPM/JCGM. link — Authoritative definitions of measurand, traceability, calibration, error, uncertainty — a model glossary discipline to imitate.
  • Metrological Traceability (FAQ & NIST policy) — NIST. link — Traceability as a documented unbroken chain of calibrations, each contributing to uncertainty: the provenance-chain blueprint for auditable estimates.
  • ISO 13528:2022 — Statistical methods for proficiency testing by interlaboratory comparison — ISO. link — How to assign reference values, score participants, and detect outlier labs: directly analogous to benchmarking multiple evaluators/models against shared items to detect drift. (Paywalled standard.)
  • M3003: The Expression of Uncertainty and Confidence in Measurement — UKAS (Ed. 6, 2024). PDF — A working, applied companion to the GUM showing how labs actually build uncertainty budgets under accreditation.
  • Standards for Educational and Psychological Testing — AERA/APA/NCME (2014). overview — The “soft” analog: a mature framework (validity, reliability, fairness) for measuring constructs with no physical reference standard — closest to evaluation’s contested, non-stationary targets.

Financial auditing is arguably the oldest industrialized evaluation-engineering discipline: centuries ago it had to formalize exactly this field’s problems. It defines assurance as reducing evaluation risk to an acceptably low level (not perfection); operationalizes “how much accuracy is worth buying” through materiality and the audit-risk model; solves “evaluate a population from a subset” with explicit sampling theory; and standardizes the object of evaluation via internal-control frameworks (COSO). Most importantly, it has a deep, self-aware literature on trust under adversarial incentives: DeAngelo’s reputation/quasi-rent theory of why large evaluators resist capture, Coffee’s “gatekeeper failure” theory of when reputational intermediaries stop deterring misconduct, and the canonical worked example of a captured evaluation system — credit-rating agencies under the issuer-pays model, whose ratings inflation the official 2008-crisis inquiry called a non-optional cog of the collapse. The recurring lesson: the payment/independence structure (the provenance of the evaluator’s incentives) dominates technical methodology in deciding whether a high-volume evaluation system stays trustworthy. Relates to Techniques (trust networks), Epistemic Culture (candidness), and the cost/accuracy frontier.

  • ISAE 3000 (Revised) — Assurance Engagements Other Than Audits/Reviews of Historical Financial Information — IAASB. link — The general-purpose engine for trusted third-party evaluation of arbitrary subject matter (ESG, controls, compliance); the closest existing analog to a generic “evaluation engineering” standard.
  • ISA 200 — Overall Objectives of the Independent Auditor — IAASB. PDF — Defines reasonable (not absolute) assurance, the audit-risk model (inherent × control × detection), materiality, skepticism, and independence as preconditions.
  • ISA 530 — Audit Sampling — IAASB. PDF — The standardized methodology for concluding about a whole population from a sample (sampling risk, sample size, monetary-unit sampling): the cost-vs-accuracy / “evaluate a subset” problem.
  • Internal Control – Integrated Framework (2013) — COSO. link — A case study in standardizing the thing being evaluated so many evaluators can assess it consistently and at known scope.
  • Auditor Size and Audit Quality — DeAngelo (1981), J. Accounting and Economics. link — Foundational theory of why evaluators stay honest: quality = probability of detecting and reporting a breach; client-specific quasi-rents give larger evaluators more reputational capital at stake.
  • Understanding Enron: “It’s About the Gatekeepers, Stupid” — Coffee (2002), Columbia Law. PDF — Defines reputational “gatekeepers” and models when they fail (when expected liability for acquiescence falls below the benefits). The model of when an evaluation system stops deterring misconduct.
  • Markets: The Credit Rating Agencies — White (2010), J. Economic Perspectives. link, with the Financial Crisis Inquiry Report (2011) PDF — How regulatory reliance plus issuer-pays seeded ratings inflation; the definitive account of a captured, load-bearing evaluation system.

The formal vocabulary for the field’s central act: compressing complex, expensive reality into cheaper, decision-usable summaries while preserving value. Shannon established the currency — entropy (the information in a source) and channel capacity — and mutual information / KL divergence quantify how much one variable (an evaluation) tells you about another (the truth). The tightest analogy for an evaluation is rate-distortion theory: the minimum bits to describe a source while keeping distortion below D — exactly “how cheap can a summary be while preserving most of its value,” with the distortion measure standing in for what must be preserved. Minimum Description Length and Kolmogorov complexity extend this to model selection (compression-as-inference), and proper scoring rules tie back to information (log score = cross-entropy, so minimizing log loss minimizes KL to the truth). Bayesian expected information gain lets you value an experiment before running it. The crucial caveat (Howard): Shannon information is not decision value — an evaluation that sharply reduces uncertainty about an irrelevant variable has high entropy reduction but zero decision value. The field needs both lenses: information theory to measure and price compression, decision/VOI theory to ensure the bits preserved are the ones that change actions. This formalizes the wiki’s working definition of an evaluation as “a procedure that converts complex information into simpler information preserving most of the value” (see Estimation vs. Evaluation).

  • A Mathematical Theory of Communication — Shannon (1948), Bell System Technical Journal. PDF — The founding paper: entropy, the bit, source/channel coding, channel capacity. The unit in which any evaluation’s information content can be measured.
  • On Information and Sufficiency — Kullback & Leibler (1951), Annals of Mathematical Statistics. link — Introduces relative entropy (KL divergence): “how far is this evaluation’s estimate from the truth.”
  • Coding Theorems for a Discrete Source with a Fidelity Criterion — Shannon (1959). PDF — Founds rate-distortion theory: the most direct formal model of an evaluation as cheap-but-lossy compression that preserves value.
  • Modeling by Shortest Data Description (MDL) — Rissanen (1978), Automatica. link — Prefer the model that most compresses the data: choosing an evaluation procedure as choosing the shortest faithful description.
  • On a Measure of the Information Provided by an Experiment — Lindley (1956), Annals of Mathematical Statistics. link — Expected (Bayesian) information gain: the way to value an evaluation/experiment in advance.
  • Elements of Information Theory — Cover & Thomas (2nd ed., 2006), Wiley. link — The standard reference covering entropy, mutual information, capacity, rate-distortion, and Kolmogorov complexity in one place.
  • See also Howard, Information Value Theory (1966) in §2 — the essential corrective that an evaluation’s worth is the decisions it changes, not the entropy it removes.

Scope note. Nine fields, ~55 entry points — deliberately a curated map, not a survey. Each section’s synthesis reflects a single literature sweep and should be treated as a starting orientation. Suggestions and corrections are welcome; candidate fields not yet covered include scientometrics/peer review, accounting measurement theory, reliability engineering, and survey methodology.