Skip to content

Evaluation Engineering

Highly optimized evaluations are (almost) all you need — treating the production of estimates and evaluations as a systems-engineering problem rather than a series of one-off studies.

Evaluation engineering is the study and practice of building systems that produce large numbers of estimates and evaluations — efficiently, consistently, and at a known cost. Its unit of analysis is not the single estimate (“how good was this grant?”) but the system that produces estimates by the thousand: who staffs it, what data it runs on, how it stays calibrated, how updates propagate, and how accuracy is traded against quantity and cost. Most existing evaluation is artisanal — bespoke reports written one at a time. The bet here is that the bottleneck has moved from how to evaluate one thing well to how to evaluate ten thousand things well enough, cheaply enough, that the results actually get used — and that this is an engineering problem with engineering answers.

The provocative version of the thesis: highly optimized evaluations are (almost) all you need. Many of the highest-stakes processes civilization runs on are evaluation problems wearing other clothes — courts, grantmaking, hiring, impact assessment, policy choice. Plausibly far more resources flow into evaluation than into clean estimation, which makes optimizing it correspondingly valuable. And the bar is lower than it looks: the outputs don’t need to be accurate in any absolute sense, only better than what people would otherwise have done — and cheap enough to actually get used. See Objections & FAQ for the case against.

This program was first sketched in 2021–22 (see Lineage) under the name advanced / symbolic evaluation systems. The argument then: prediction markets and forecasting tournaments had underdelivered not because prediction is useless, but because prediction is one component of a larger architecture — ontology, calculation, forecasting, evaluation — and nobody was building the whole machine. The missing piece was a cheap, general executor for the labor-intensive parts. Large language models are now a candidate for exactly that, which turns a speculative research agenda into a buildable one.

  • The systems view — why an evaluation system is a different object from an evaluation, and which trade-offs (accuracy × quantity × cost) define its design space. See Evaluation as a System.
  • The components — the four reusable parts that show up in every serious system: prediction, calculation, ontology, and evaluation. See The Four Components.
  • The methods and techniques — the catalogue of ways to actually produce evaluations (expert panels, surveys, statistical and composite measures) and the system-level techniques that tie cheap and expensive judgments together. See Evaluation Methods and Techniques.
  • The environment — the cultural and institutional conditions a system needs to survive contact with the world. See Epistemic Culture.
#PageStatus
1Evaluation Engineeringdraft
2Estimation vs. Evaluationdraft
3Why It Matters — Use Casesdraft
4Cruxesdraft
Lineagedraft
IThe Systems View
4Evaluation as a Systemdraft
5The Four Componentsdraft
Evaluation Systems in the Wildcatalogue
Patterns & Failure Modesdraft
IIMethods & Techniques
6Evaluation Methodsdraft
7Techniquesdraft
IIIThe Environment
8Epistemic Culturedraft
Glossarydraft
Objections & FAQdraft
Related Work (QURI)draft
Adjacent Fields & Literaturedraft
Open Problemsaggregated

Read it offline or feed it to an LLM. The whole wiki is also available as a single file — llms-full.txt — with every page concatenated in reading order.

This wiki is in an early, exploratory stage. Pages are working notes, not settled positions. It is a sibling to the Robust Reasoning Processes wiki and part of the CAIRN project by QURI.