Skip to content

Objections & FAQ

Status: early draft, adapted from the founding posts and recorded external feedback (see Lineage).

The field is young and the framing is uncertain. This page collects the objections worth taking seriously, the responses currently on offer (often partial), and outside commentary.

”This is too abstract to be a real field or cause area”

Section titled “”This is too abstract to be a real field or cause area””

Granted: the area is vague. There are no crisp boundaries between evaluation systems, forecasting, institutional decision-making, epistemics, and related terms. But vagueness doesn’t imply unimportance — if anything it has made the area more neglected than it would otherwise be. The operative test isn’t “can we define it cleanly” but “does this framing help us identify valuable concrete projects?” If good projects get started, the higher-level carving can be revised later. (If you can carve the space better, that itself is a contribution — see Cruxes.)

”Scaled evaluations can never be accurate enough”

Section titled “”Scaled evaluations can never be accurate enough””

A common reaction is that these proposals are AGI-complete — usually resting on the assumption that the outputs must be highly accurate. They don’t. The estimates and evaluations only have to be better than what people would have done otherwise. People already make enormous numbers of informal evaluations and judgments; these are typically noisy, overconfident, and inaccurate. The bar is to beat that baseline cheaply, not to be correct in absolute terms.

”Why invent so much terminology instead of engaging the literature?”

Section titled “”Why invent so much terminology instead of engaging the literature?””

The work spans many fields, and one has to stop somewhere. The most conspicuous gap is the established academic field of Evaluation (program evaluation): the bet is that the bulk of this cause area lies where that field has shown little interest, but that’s a bet, not a verdict. Recommendations and pointers are genuinely wanted. The decision-automation literature is also relevant and under-engaged.

”Why write theory instead of just building?”

Section titled “”Why write theory instead of just building?””

The author has spent years building tools and running practical experiments in the space, and concluded that (a) his thinking was unusually hard to articulate and diverged from others’, and (b) hiring a large team to “just build it” is itself bottlenecked on having a clearly articulated vision. Theory first, then a return to direct work — not theory instead of it.

”Why make evaluation specifically such a big deal?”

Section titled “”Why make evaluation specifically such a big deal?””

This is the “evaluations are all you need” claim, and it’s load-bearing:

  • Prior forecasting discussion has under-weighted evaluation. You can get far with estimation alone, then hit a wall; foregrounding evaluation keeps it from being overlooked.
  • Evaluation may be the bulk of the desired output. Many of the highest-stakes existing processes are evaluation problems: courts, impact estimation, hiring decisions, grantmaking. It’s plausible that far more global resources go into evaluation than into estimation — which would make optimizing it correspondingly valuable.

The honest scale estimate: a challenge on the order of autonomous driving or ending aging — plausibly absorbing $100B over 20 years. The expectation is not that effective altruists fund most of it. Companies are already working in the space and will continue to; the high-leverage role is to figure out how such work can be made useful for altruistic purposes and to nudge the field in beneficial directions. The mental model is closer to clean meat than to AI alignment — shaping and accelerating a field that will largely happen anyway, rather than carrying it alone.

A note on capability grading: it would help to be able to grade evaluation systems the way autonomous driving has “Level 4.” Formalizing the inputs and outputs of estimation/evaluation work would let us draw historical trends and make projections — and give the field a shared yardstick. (See Evaluation as a System.)

Not individuals or small side-projects. The model is full-time specialist teams — think hedge-fund analyst desks or data-science teams — producing outputs for large organizations and the public. Small-scale benefits are a welcome side effect, not the goal. The one area where small-group work is on the critical path is epistemic culture, which likely has to be tested in small communities first.

It helps to record outside reactions verbatim-in-spirit, especially critical ones. Notes from Mark Xu on an early version:

  • The most exciting part is the ability to pay for outcomes instead of processes — and possibly to outsource evaluation. But much of the current bottleneck is simply people capable of doing the evaluation work (e.g. grantmakers), and it’s unclear what concrete proposal solves that.
  • The overall sketch is “a bit too vague to have strong opinions about.” It seems clearly useful if done well, useless if done poorly.
  • Many things he’s wanted good estimates for turn out to be questions about how the economy actually works (e.g. “if all software engineers got 10% more productive, how much bigger does the economy get?”) more than about estimation/evaluation methodology per se.
  • Existing shallow evaluations (QURI’s, epistemic spot checks, the ALLFED shallow eval) have already seemed useful — possibly because the current state of evaluation is so poor that even shallow work helps. A generically useful move is “assumption unearthing”: what has to be true about the world for an organization’s claims to hold?
  • A skeptical magnitude estimate: even perfect evaluations might not increase the amount of useful work being done by more than ~2× — because the EA funding space has already funded the obviously-good things. (Doubling would still be very good.)
  • The single most wanted thing: a few specific, concrete (even fictional) case studies showing the value generated by better estimation and evaluation.

That last request is partly answerable today. A handful of real experiments and deployments now exist — the ~73% amplification result (2019), shallow org evaluations, pre-execution project-value prediction, and deployed tools (Squiggle AI, RoastMyPost) with published usage and failure data. They’re small, but they’re concrete. See Related Work for the full inventory. Turning them into crisp, persuasive case studies remains a standing gap; see Open Problems.