Evaluation Engineering
The field in one paragraph
Section titled “The field in one paragraph”Evaluation engineering is the study and practice of building systems that produce large numbers of estimates and evaluations — efficiently, consistently, and at a known cost. Its unit of analysis is not the single estimate (“how good was this grant?”) but the system that produces estimates by the thousand: who staffs it, what data it runs on, how it stays calibrated, how updates propagate, and how accuracy is traded against quantity and cost. Most existing evaluation is artisanal — bespoke reports written one at a time. The bet here is that the bottleneck has moved from how to evaluate one thing well to how to evaluate ten thousand things well enough, cheaply enough, that the results actually get used — and that this is an engineering problem with engineering answers.
Why evaluation, specifically?
Section titled “Why evaluation, specifically?”The provocative version of the thesis: highly optimized evaluations are (almost) all you need. Many of the highest-stakes processes civilization runs on are evaluation problems wearing other clothes — courts, grantmaking, hiring, impact assessment, policy choice. Plausibly far more resources flow into evaluation than into clean estimation, which makes optimizing it correspondingly valuable. And the bar is lower than it looks: the outputs don’t need to be accurate in any absolute sense, only better than what people would otherwise have done — and cheap enough to actually get used. See Objections & FAQ for the case against.
Why now
Section titled “Why now”This program was first sketched in 2021–22 (see Lineage) under the name advanced / symbolic evaluation systems. The argument then: prediction markets and forecasting tournaments had underdelivered not because prediction is useless, but because prediction is one component of a larger architecture — ontology, calculation, forecasting, evaluation — and nobody was building the whole machine. The missing piece was a cheap, general executor for the labor-intensive parts. Large language models are now a candidate for exactly that, which turns a speculative research agenda into a buildable one.
The shape of the field
Section titled “The shape of the field”- The systems view — why an evaluation system is a different object from an evaluation, and which trade-offs (accuracy × quantity × cost) define its design space. See Evaluation as a System.
- The components — the four reusable parts that show up in every serious system: prediction, calculation, ontology, and evaluation. See The Four Components.
- The methods and techniques — the catalogue of ways to actually produce evaluations (expert panels, surveys, statistical and composite measures) and the system-level techniques that tie cheap and expensive judgments together. See Evaluation Methods and Techniques.
- The environment — the cultural and institutional conditions a system needs to survive contact with the world. See Epistemic Culture.
The pages
Section titled “The pages”| # | Page | Status |
|---|---|---|
| 1 | Evaluation Engineering | draft |
| 2 | Estimation vs. Evaluation | draft |
| 3 | Why It Matters — Use Cases | draft |
| 4 | Cruxes | draft |
| — | Lineage | draft |
| I | The Systems View | |
| 4 | Evaluation as a System | draft |
| 5 | The Four Components | draft |
| — | Evaluation Systems in the Wild | catalogue |
| — | Patterns & Failure Modes | draft |
| II | Methods & Techniques | |
| 6 | Evaluation Methods | draft |
| 7 | Techniques | draft |
| III | The Environment | |
| 8 | Epistemic Culture | draft |
| — | Glossary | draft |
| — | Objections & FAQ | draft |
| — | Related Work (QURI) | draft |
| — | Adjacent Fields & Literature | draft |
| — | Open Problems | aggregated |
Read it offline or feed it to an LLM. The whole wiki is also available as a single file — llms-full.txt — with every page concatenated in reading order.
Status
Section titled “Status”This wiki is in an early, exploratory stage. Pages are working notes, not settled positions. It is a sibling to the Robust Reasoning Processes wiki and part of the CAIRN project by QURI.