Estimation vs. Evaluation
Status: early draft, adapted from the 2021–22 estimation-theory notes (see Lineage).
This distinction is the load-bearing one. If you only take one idea from this wiki, take this one.
Estimation
Section titled “Estimation”Estimation is the calculation of specific numbers, usually under uncertainty. It is a superset of ordinary numeric calculation: summing itemized expenses is “estimation” with no uncertainty; a Fermi estimate is “estimation” with a lot.
The defining property: the estimator only has to be correct. They don’t need to worry about how the result is interpreted, who trusts them, or what the number does once released. The challenge is purely accuracy.
Examples:
- How many piano tuners are in Boston right now?
- How many total hours have been spent reading a particular blog post?
- How much do Americans spend on mechanical keyboards per year?
Estimation leans on logic, math, economics, and data engineering.
Evaluation
Section titled “Evaluation”Evaluation is similar — it also produces a judgment, often numeric — but for things that are messy: results that are difficult or impossible to verify or fully trust. Evaluations either avoid formal models or use them as one input among many (à la cluster thinking).
The defining property: here the effect on the audience matters. The number usually needs explanation, the explanation needs to be tailored to readers, and — crucially — the result is only useful if the relevant people trust it. An excellent evaluation nobody believes changes nothing.
Examples:
- On a scale of 0–100, how good a job did Barack Obama do as president?
- What is the probability that we live in a simulation?
- How much did organization X reduce existential risk from 2000 to 2020?
Evaluation leans on epistemology, sociology, survey methodology, and the “soft” sciences.
The distinction is a gradient, not a wall
Section titled “The distinction is a gradient, not a wall”There is no crisp line. Many real questions sit in between. A rough contrast:
| Estimation | Evaluation |
|---|---|
| Highly quantitative | Highly qualitative |
| Relies on equations/models | Relies on judgment and intuition |
| Easy for parties to agree on | Parties hold different underlying intuitions |
| Little trust in the estimator needed | Lots of trust in the evaluator needed |
| Terminology rarely contested | Terminology frequently contested |
| Minimal explanation | Often substantial explanation |
| Usually numeric | Numeric, grades, scales, or prose |
| Math, programming, data, economics | Economics, sociology, epistemology, mixed methods |
A useful intuition pump: which questions would you hand to a sharp quantitative analyst (estimation), and which would you want a team of trusted domain experts or strong generalists on (evaluation)?
Why separate them: divide and conquer
Section titled “Why separate them: divide and conquer”The payoff of the distinction is a design strategy, borrowed from the functional-programming idea of separating pure from impure code:
Handle as much as possible as estimation. Sequester the genuinely judgment-bound parts into a separate evaluation layer. Don’t let the messiness of one bleed into the cleanliness of the other.
Pushed further, you get three nested layers, ordered by how verifiable they are — evaluation on the outside, a verifiable core of data and pure math at the center:
flowchart TB
subgraph eval[Evaluation: judgment-bound, trust-dependent]
subgraph est[Estimation: models & calculation]
core[Data & pure math:<br/>verifiable]
end
end
- Verifiable — raw data, mathematical facts, proofs.
- Estimation — derived numbers from models and calculation.
- Evaluation — the irreducibly judgment-bound calls.
The heuristic: do as much work as possible in the deeper (more verifiable) layers, and keep the layers separate. Every claim you can demote from “evaluation” to “estimation,” and from “estimation” to “verifiable,” gets cheaper, more trustworthy, and easier to keep consistent at scale.
Two notes on naming
Section titled “Two notes on naming”- There is already an academic field called Evaluation (program evaluation, rooted in the social sciences). It overlaps with this usage but is centered on bespoke studies and long reports rather than high-throughput systems. We borrow lessons but reframe the scope.
- “Estimation” and “evaluation” are deliberately plain, unromantic words. The priority is honest categories that won’t collide with existing terminology, not memorable branding. Better names may come later.
Where this goes
Section titled “Where this goes”The estimation/evaluation split is what makes the component architecture coherent: prediction and calculation mostly serve the estimation layer, evaluation methods serve the evaluation layer, and ontology organizes the questions both operate on. The system-level techniques — especially prediction–evaluation systems — are largely about cheaply bridging the two.