Evaluation Engineering

Highly optimized evaluations are (almost) all you need — treating the production of estimates and evaluations as a systems-engineering problem rather than a series of one-off studies.

Start with Chapter 1 Why It Matters

The field in one paragraph

Evaluation engineering is the study and practice of building systems that produce large numbers of estimates and evaluations — efficiently, consistently, and at a known cost. Its unit of analysis is not the single estimate (“how good was this grant?”) but the system that produces estimates by the thousand: who staffs it, what data it runs on, how it stays calibrated, how updates propagate, and how accuracy is traded against quantity and cost. Most existing evaluation is artisanal — bespoke reports written one at a time. The bet here is that the bottleneck has moved from how to evaluate one thing well to how to evaluate ten thousand things well enough, cheaply enough, that the results actually get used — and that this is an engineering problem with engineering answers.

Why evaluation, specifically?

The provocative version of the thesis: highly optimized evaluations are (almost) all you need. Many of the highest-stakes processes civilization runs on are evaluation problems wearing other clothes — courts, grantmaking, hiring, impact assessment, policy choice. Plausibly far more resources flow into evaluation than into clean estimation, which makes optimizing it correspondingly valuable. And the bar is lower than it looks: the outputs don’t need to be accurate in any absolute sense, only better than what people would otherwise have done — and cheap enough to actually get used. See Objections & FAQ for the case against.

Why now

This program was first sketched in 2021–22 (see Lineage) under the name advanced / symbolic evaluation systems. The argument then: prediction markets and forecasting tournaments had underdelivered not because prediction is useless, but because prediction is one component of a larger architecture — ontology, calculation, forecasting, evaluation — and nobody was building the whole machine. The missing piece was a cheap, general executor for the labor-intensive parts. Large language models are now a candidate for exactly that, which turns a speculative research agenda into a buildable one.

The shape of the field

The systems view — why an evaluation system is a different object from an evaluation, and which trade-offs (accuracy × quantity × cost) define its design space. See Evaluation as a System.
The components — the four reusable parts that show up in every serious system: prediction, calculation, ontology, and evaluation. See The Four Components.
The methods and techniques — the catalogue of ways to actually produce evaluations (expert panels, surveys, statistical and composite measures) and the system-level techniques that tie cheap and expensive judgments together. See Evaluation Methods and Techniques.
The environment — the cultural and institutional conditions a system needs to survive contact with the world. See Epistemic Culture.

The pages

#	Page	Status
1	Evaluation Engineering	draft
2	Estimation vs. Evaluation	draft
3	Why It Matters — Use Cases	draft
4	Cruxes	draft
—	Lineage	draft
I	The Systems View
4	Evaluation as a System	draft
5	The Four Components	draft
—	Evaluation Systems in the Wild	catalogue
—	Patterns & Failure Modes	draft
II	Methods & Techniques
6	Evaluation Methods	draft
7	Techniques	draft
III	The Environment
8	Epistemic Culture	draft
—	Glossary	draft
—	Objections & FAQ	draft
—	Related Work (QURI)	draft
—	Adjacent Fields & Literature	draft
—	Open Problems	aggregated

Read it offline or feed it to an LLM. The whole wiki is also available as a single file — llms-full.txt — with every page concatenated in reading order.

Status

This wiki is in an early, exploratory stage. Pages are working notes, not settled positions. It is a sibling to the Robust Reasoning Processes wiki and part of the CAIRN project by QURI.