Open Problems

Status: aggregated. This collects the open questions scattered across the wiki into one place. Most are unresolved; many are barely scoped.

For the higher-level cruxes — the questions that would most redirect the field — see Cruxes. This page is the longer, more granular list, grouped by area.

Estimation vs. evaluation

How large a fraction of judgment-bound questions can actually be demoted to cheap, verifiable estimation? Where does the divide-and-conquer strategy top out?
Can an LLM-produced evaluation ever earn the trust that an expert panel’s does, or only match its content? Trust, not accuracy, is evaluation’s binding requirement.

The systems view

What’s the right unit and method for measuring an evaluation system’s accuracy, throughput, and cost — so that two systems can be compared?
How do you detect inconsistency across thousands of outputs automatically, rather than relying on no one noticing?
What infrastructure makes propagation (re-deriving downstream estimates when an input changes) cheap enough to be the default?

Components

Ontology: is structuring the questions the real bottleneck, more than answering them? What tooling would make large structured question sets cheap to build and maintain?
Prediction: how far can aggregation and calibration be pushed when most “predictors” are cheap models rather than scored humans?
Calculation / estimation functions: what does the tooling (uncertainty, caching, composition, dependency tracking) need to look like for estimation functions to compose at scale?

Evaluation methods

Elicitation: how to phrase questions for evaluators; how to elicit utility and value judgments cleanly.
Reliability: how reliable are evaluations in practice, and how do you keep them distinguishable from the forecasts that target them?
Pricing: how do you put a defensible cost (and value) on a messy, normative, long-horizon evaluation so it can be traded against accuracy?
Composite measures: how do you make sub-measure choice and weighting non-arbitrary, and robust under adversarial pressure?

Bridging cheap and expensive judgment

Do prediction–evaluation systems’ incentives survive gaming and deceptive participants?
What’s the optimal allocation of a fixed evaluation budget across a large question set?

The environment

Is epistemic culture genuinely the binding constraint, and is it more tractable than the technical problems?
What rollout sequence lets a high-throughput public evaluation system deploy without being shut down or captured?
Do automated trust networks actually prevent capture, or just relocate it?

Demonstrating the value

Concrete case studies. The most-requested missing piece (per external commentary): a few specific, concrete — even fictional — case studies showing the value generated by better estimation and evaluation. The field is long on architecture and short on worked examples.
Is the bottleneck capable people, not tooling? Much current evaluation (e.g. grantmaking) is bottlenecked on people able to do the work. Does better tooling relieve that, or just relocate it? Can the work actually be outsourced?
How much does it really add? A skeptical estimate holds that even perfect evaluations might not increase useful work by more than ~2×, because the obviously-good things are already funded. Is that right, and does it change the case?

The field itself

Is “evaluation engineering” the right frame and name, or one more provisional label in the lineage?
Which domain should the first serious end-to-end system target, to learn the most per dollar?
A capability ladder. Can we define graded levels of evaluation-system capability (à la “Level 4” autonomy), and formalize inputs/outputs well enough to chart trends and project forward? See Evaluation as a System.

If you can sharpen, answer, or add to any of these, that’s the contribution this wiki most wants.