Evaluation Systems in the Wild

Status: early draft / curated catalogue, assembled from a June 2026 sweep. This is the descriptive companion to the conceptual pages: the world is already full of standing systems that produce many evaluations of a repeated type. They are the field’s natural experiments — its documented successes, failures, and capture stories. The founding program called for exactly this survey (the proto “process catalogue”).

How to read this

Every entry names what it evaluates, its output format, its method, and one notable weakness or failure. Before the catalogue, four cross-cutting lenses that the ~100 systems below keep illustrating:

Method archetypes. Independent lab testing · professional anonymous inspection · expert panel/committee · critic aggregation · crowd reviews (often Bayesian/credibility-weighted) · two-sided/reciprocal rating · statistical/algorithmic model · composite index · market-based · regulatory review. These are the menu on Evaluation Methods, seen in the wild.

Output formats. Stars (1–5), points (0–100), letter grades (A–F, AAA–D), rankings, pass/fail certification, tiers, probabilities, and dollar estimates. The format is an editorial choice with consequences — binary “fresh/rotten” discards intensity; 100-point wine scales compress to 88–100.

Funding determines capture risk — the single most predictive variable, echoing the audit/ratings literature and the trust-network and candidness discussions. Three archetypes:

Independent / nonprofit, buys its own units, no ads (Consumer Reports, Which?, Stiftung Warentest, IIHS) — strongest independence.
Affiliate / ad-funded editorial (Wirecutter, RTINGS, CNET) — recommendation-revenue conflict.
Rated party pays — issuer-pays ratings, fee-for-certification, award-licensing (Moody’s/S&P, UL, ISO, LEED, J.D. Power, DXOMARK) — highest capture exposure.

Recurring failure modes. Fake reviews / astroturfing / review bombing · gaming and Goodharting (citations, ratings) · capture (credit ratings in 2008; the World Bank’s Doing Business scandal) · pay-to-play · grade inflation · self-declaration abuse · snapshot-in-time validity · reciprocal retaliation in two-sided systems · opaque weighting.

Consumer products & testing

Consumer Reports — consumer products & services. 0–100 scores + “Recommended”/“Best Buy”. Independent lab testing + member reliability surveys; nonprofit, buys all units at retail, no ads. Weakness: affiliate-link revenue creates a perceived conflict.
Which? (UK) — products & services. Scores + “Best Buy”/“Don’t Buy”. Independent lab testing; nonprofit, no manufacturer money.
Stiftung Warentest (Germany) — products & services. German school-grade scale, printed on packaging. Undercover purchasing + outsourced scientific testing; ad-free, government-seeded foundation. Weakness: sued by manufacturers ~10×/year.
CHOICE (Australia) — products & services. Scores + “CHOICE Recommended”. Accredited in-house labs; nonprofit.
Wirecutter (NYT) — consumer gear. Narrative “top pick”/“budget pick”, no scores. Hands-on reviewer testing; affiliate revenue, no on-site ads.
RTINGS — TVs, monitors, headphones, etc. 0–100 overall + per-use-case scores. Standardized in-house bench measurements. Weakness: 2025 scoring overhaul drew backlash over weighting; 2026 paywall.
DXOMARK — camera/phone image, audio, display. Open-scale scores + sub-scores. Lab + structured perceptual testing. Weakness: core conflict — sells consulting to the firms it scores.
Tom’s Hardware / PCMag — PC components/devices. Stars + “Editors’ Choice” + hierarchy charts. Standardized benchmarking labs; affiliate + ads.
J.D. Power — vehicle quality/satisfaction. PP100 (problems per 100) + segment awards. Large owner surveys. Weakness: clients are the automakers; winners license the awards to advertise.
Kelley Blue Book / Edmunds — vehicle valuation & reviews. Dollar values (TMV / Blue Book Value). Statistical models on transaction data. Weakness: dealer-referral revenue; values can diverge from actual sales.
Robert Parker / Wine Advocate, Wine Spectator — wine. 100-point scale. Professional critics, often blind. Weakness: “Parker palate” homogenization; score compression into 88–100.
Untappd, BeerAdvocate, RateBeer — beer. Crowd star/score averages. Weakness: hype/novelty bias; RateBeer is owned by AB InBev (BeerAdvocate/Untappd by Next Glass) — big-brewer ownership of the rater.
Coffee Review — coffee. 100-point scale. Expert blind cupping. Weakness: pay-to-submit service; mostly 90+ published.
America’s Test Kitchen / Cook’s Illustrated — kitchen gear, ingredients, recipes. Tiered verdicts. Expert panels, blind taste tests, heavy repeated testing; no ads.

Media, entertainment & content

IMDb — films, TV, people. 1–10 Bayesian-weighted average; “Top 250”. Crowd votes. Weakness: vote brigading; demographic skew; polarized 1/10 voting.
Rotten Tomatoes — film/TV. % “fresh” critics + audience score. Binary critic aggregation (discards intensity). Weakness: review bombing of audience scores; binary loses nuance.
Metacritic — film/TV/games/music. 0–100 Metascore. Weighted critic average. Weakness: undisclosed weights; user-score review bombing.
OpenCritic — games. Top Critic Average (transparent, unweighted mean). Weakness: no user component; small samples for niche titles.
Steam user reviews — games. Positive/negative tiers, “Recent” vs “All-time”. Owner-gated binary. Weakness: protest review bombing.
Goodreads — books. 1–5 simple average. Crowd, minimal verification. Weakness: sockpuppet scandals; pre-publication bombing of unreleased books.
Letterboxd — film. 0.5–5 stars, weighted average. Cinephile crowd.
RateYourMusic — music. Credibility-weighted crowd charts. Weakness: opaque user-weighting; canon/obscurity skew.
MyAnimeList / AniList — anime/manga. Bayesian-weighted scores. Weakness: score inflation; seasonal brigading.
Billboard charts — songs/albums. Weekly ranking. Statistical blend of streams + sales + airplay. Weakness: bundling/stream-campaign manipulation; opaque weights.
Pitchfork — albums. Single critic 0.0–10.0. Weakness: single-reviewer subjectivity.
Nielsen — TV/streaming audience. Ratings/share. Panel + (since 2025) big-data hybrid. Weakness: panel sampling error for niche audiences; clients are the rated networks.
Common Sense Media — media for kids. Age (2–18) + 5-star quality. Expert reviewers on child-development criteria; nonprofit.
Age/content boards — MPA (G–NC-17, anonymous parent panel), ESRB (games), PEGI (games). Self-regulatory; rely on publisher disclosure (hidden content can slip).

Finance, credit, insurance & risk

FICO / VantageScore — consumer credit. 300–850. Proprietary statistical model. Weakness: opacity; thin-file exclusion; entrenched gatekeeper.
Credit bureaus — Experian, Equifax, TransUnion. Full credit reports. Data aggregation. Weakness: common data errors hard to dispute; the 2017 Equifax breach (~147M people).
Bond/sovereign ratings — Moody’s, S&P Global, Fitch. AAA–D letter scales. Analyst committee + models, issuer-pays. Weakness: the canonical capture story — inflated AAA on mortgage CDOs, ~$864M+ settlements after 2008.
Morningstar — funds/stocks. 1–5 stars (quant, backward-looking), Medalist (forward-looking), Economic Moat. Weakness: star ratings weakly predict future performance; “star chasing”.
ESG ratings — MSCI (AAA–CCC), Sustainalytics (0–100 risk), S&P Global ESG. Weakness: ratings divergence — inter-rater correlation ~0.54 vs. ~0.92 for credit ratings (MIT “Aggregate Confusion”).
A.M. Best — insurer financial strength. A++–F. Insurance-specialist analysis; largely issuer-pays.
Credit-based insurance scores — LexisNexis, FICO. Risk scores for underwriting. Weakness: fairness/proxy-discrimination concerns; restricted or banned in several US states.
Zillow Zestimate — home value. Dollar estimate + range. ML automated valuation. Weakness: off-market median error ~7%; ignores condition; “not an appraisal”.
Cyber risk — BitSight (250–900), SecurityScorecard (A–F), CVSS (0–10, open standard). Weakness: external-only signals; CVSS severity routinely conflated with risk → “everything is Critical”.
Dun & Bradstreet PAYDEX — business payment reliability. 1–100. Vendor-reported trade data.

Academia, science & education

Scholarly peer review — manuscripts. Accept/revise/reject. Expert review, mostly unpaid. Weakness: low inter-rater reliability (“lottery”); slow; weak fraud screening.
Journal Impact Factor (Clarivate) — journals. Citation ratio. Weakness: heavily gamed (coercive/self-citation, cartels); DORA condemns its use to judge individuals.
h-index — authors. Single integer (productivity × impact). Weakness: field-dependent; gameable via self-citation; can’t decrease.
Citation databases — Web of Science, Scopus (Elsevier — also a publisher), Google Scholar (widest, least curated).
Altmetric — online attention. Weighted “donut” score. Weakness: measures attention, not quality; gameable.
University rankings — QS, THE (reputation-survey heavy), ARWU/Shanghai (objective, prize-weighted), US News (self-reported data enabled the Columbia fraud; 2023 boycott), Leiden (bibliometric, deliberately no composite).
REF (UK Research Excellence Framework) — university research. 4*–1* profiles. Expert panel review; allocates ~£2B/yr. Weakness: very high administrative cost.
GRADE / Cochrane RoB 2 — evidence quality / trial bias. Tiered ratings. Structured expert rating. Weakness: domain judgments still subjective.
Standardized tests — SAT/ACT, GRE, PISA, TIMSS. Weakness: scores track family income; teaching-to-the-test.
School ratings — GreatSchools (1–10; historically correlated with race/affluence), Ofsted (England; replaced single-word grades with report cards in 2025 after criticism).
Accreditation — ABET (engineering/computing), AACSB (business schools), US institutional accreditors (Title IV gatekeepers). Weakness: peers accredit peers (conflict); slow on failing schools.

Health, safety, standards & certification

Hospital ratings — CMS star ratings (1–5, federal), Leapfrog (A–F safety), US News Best Hospitals, Healthgrades. Weakness: CMS criticized for penalizing complex/teaching hospitals; Healthgrades sells ads to the hospitals it rates.
Restaurant hygiene — NYC letter grades (A/B/C), UK Food Hygiene Rating Scheme (0–5). Unannounced inspections; municipal, no fee-for-grade. Weakness: snapshot validity; inspection inconsistency.
Drug/device — FDA, EMA, NICE (cost-per-QALY HTA). Approve/not. Expert regulatory review. Weakness: user-fee funding criticized as “cozy”; QALY thresholds called arbitrary.
Crash tests — IIHS (Good–Poor + Top Safety Pick; insurer-funded, independent of makers), Euro NCAP (0–5 stars), NHTSA 5-Star (most cluster at 4–5★). Weakness: limited scenario set; “test to the test”.
Product safety — UL (lab testing + factory audits, fee-for-cert), CE marking (mostly self-declared). Weakness: UL cost barrier + counterfeit marks; CE self-declaration is gameable.
Energy — Energy Star (a 2010 GAO sting certified a gas-powered “alarm clock” → triggered third-party testing), EU energy label (A–G; rescaled 2021).
ISO 9001 certification — quality-management systems. Pass/fail + surveillance audits. Third-party audit, client pays the auditor. Weakness: “audit shopping”; certifies process not outcome.
LEED — green buildings. Certified–Platinum, points-based; fee-for-cert. Weakness: design- not performance-based — certified buildings don’t reliably use less energy.
B Corp — whole-company social/environmental. Pass/fail seal (≥80/200); fee-for-cert. Weakness: bar seen as low; Dr. Bronner’s dropped the cert in 2025 over multinational dilution.
Food/agriculture — USDA Organic, Fairtrade, MSC seafood (logo-royalty conflict), Rainforest Alliance. Weakness: royalty/fee models create incentives to certify generously.

Hospitality, travel & local business

Michelin Guide — restaurants/hotels. 1–3 stars. Professional anonymous inspectors, multiple visits. Weakness: tourism boards increasingly pay for regional entry (conflict); fine-dining/Eurocentric bias.
AAA Diamonds — N. American hotels/restaurants. 1–5 Diamonds. Anonymous inspectors; nonprofit. Forbes Travel Guide — luxury. 4–5 Star. Inspectors on ~900 standards. Weakness: Forbes also sells training on how to earn its ratings.
Hotel star systems — accommodations. 1–5 stars. Hotelstars Union standardizes 21 European countries; the US has no government system (self-declared “5-star” is meaningless).
Yelp — local businesses. 1–5 stars + automated review filter. Weakness: long-running extortion / pay-to-play allegations.
TripAdvisor — travel. 1–5 bubbles. Crowd, no proof of stay. Weakness: a 2018 investigation alleged ~1 in 3 reviews fake; 200k+ AI-generated reviews removed in 2024.
Google Reviews — places. 1–5 stars + AI moderation. Weakness: ~240M fake reviews removed in 2024; extortion scams at scale.
Booking.com / Hotels.com — accommodations. Score /10, verified guests only, recency-weighted. Weakness: commission model is a structural conflict.
Trustpilot — businesses. TrustScore (Bayesian-weighted). Weakness: paying businesses get more tools (two-tier criticism).
BBB grades — business trustworthiness. A+–F. Composite + accreditation fees. Weakness: a 2010 sting got a fake company an A+ for ~$425 (pay-for-grade).
Glassdoor — employers. 1–5 stars, anonymous, “give to get”. Weakness: anonymity enables fakes; the rated employer pays the host.

Online platforms & reputation systems

eBay feedback — sellers. % positive + detailed star ratings. Transaction-linked. Weakness: extreme grade inflation; seller retaliation led eBay to bar negative buyer feedback.
Amazon reviews — products/sellers. 1–5 stars + Verified Purchase. ML-weighted crowd. Weakness: persistent fake/incentivized reviews; the FTC’s 2024 fake-review rule targets this.
Airbnb — hosts/guests. 1–5 stars, double-blind reveal, Superhost badge. Weakness: retaliation/extortion via review leverage; strong inflation (~4.8+ norm).
Uber / Lyft — drivers/riders. 1–5 reciprocal rolling average. Weakness: drivers deactivated below ~4.6; a 2020 suit alleged aggregating biased customer ratings is discriminatory.
DoorDash — Dashers. 1–5 (last 100) + completion %. Weakness: low deactivation thresholds; ratings reflect restaurant/app delays outside the driver’s control.
Stack Overflow reputation — Q&A expertise. Points + privilege tiers. Weakness: voting rings / sockpuppets (study).
GitHub stars — repo popularity. Integer count. Weakness: a fake-star economy — millions of bought stars, often promoting malware (study).
Reddit karma — contribution. Numeric. Weakness: karma farming via reposts/bots.
App store ratings — Apple (legacy ratings persist), Google Play (recency-weighted). Weakness: bought reviews + review bombing.
Wikipedia pending-changes / editor trust — editor trustworthiness. Permission flags + edit counts. Automated thresholds + admin grants. Weakness: edit count is a shallow, gameable proxy.

Sports & competition rankings

Elo / Glicko-2 — player skill. Numeric rating (Glicko adds a confidence/deviation term). Zero-sum statistical update. Weakness: single K-factor models uncertainty crudely; pool-wide inflation.
FIDE — chess. Elo with tiered K-factors. Weakness: decades-long inflation debates.
ATP / WTA — tennis. Rolling 52-week points. Weakness: no opponent-strength weighting.
OWGR — golf. Strength-of-field-weighted points. Weakness: the LIV Golf exclusion controversy.
FIFA rankings — national football teams. Elo-based “SUM” model (since 2018, fixing the gameable old system).
Sabermetrics / WAR — baseball player value, in wins. Statistical composite. Weakness: the two main versions (bWAR, fWAR) disagree — “which WAR?”.
College Football Playoff committee, AP Poll, Coaches Poll — top-25 rankings. Expert/voter judgment. Weakness: opacity, reputation bias, and (Coaches Poll) direct conflicts of interest.

Corruption Perceptions Index (Transparency International) — public-sector corruption. 0–100. Composite of expert/business surveys. Weakness: measures perceptions, not corruption.
Freedom in the World (Freedom House) — political rights/civil liberties. 0–100 + Free/Partly/Not Free. Expert assessment. Weakness: majority US-government funded (independence critique).
V-Dem — democracy (5 dimensions). 0–1 indices. ~3,500 expert coders → Bayesian IRT with explicit uncertainty bounds. Weakness: expert-coding subjectivity; complex to audit.
EIU Democracy Index — democracy. 0–10 + regime type. Weakness: opaque, proprietary, anonymous experts.
Human Development Index (UNDP) — health/education/income. 0–1. Geometric mean of three indicators. Weakness: only three crude dimensions; arbitrary weighting.
World Press Freedom Index (RSF) — press freedom. 0–100. Abuse tally + expert survey.
Worldwide Governance Indicators (World Bank) — six governance dimensions, with standard errors.
World Bank Doing Business (DISCONTINUED) — ease of doing business. Killed in September 2021 after audits found deliberate data manipulation favoring certain countries under leadership pressure — the cleanest documented case of index capture.
Gallup World Poll / World Happiness Report — wellbeing. Survey means (Cantril Ladder 0–10). Large-N self-report (not expert perception). Weakness: translation/cultural bias; over-reading a single question.
Also: Global Peace Index, WJP Rule of Law Index, Environmental Performance Index, and the ideologically-framed economic-freedom indices (Heritage, Fraser).

Charity & nonprofit evaluation

GiveWell — global health/development charities. Short “Top Charities” list + cost-per-life-saved estimates. Deep in-house CEA, publishes full models. Weakness: very narrow, evidence-rich cause focus.
Charity Navigator — US 501(c)(3)s. 0–4 stars / 0–100. Largely automated from Form 990s + impact “beacons”. Weakness: historic overhead-ratio reliance is a poor, gameable impact proxy.
Candid / GuideStar — nonprofit profiles. Bronze–Platinum transparency seals. Self-reported data. Weakness: seals measure disclosure, not effectiveness.
Animal Charity Evaluators, Founders Pledge — impact-focused evaluation in harder-to-measure causes. ImpactMatters (cost-per-impact) was folded into Charity Navigator (2020).
Also: CharityWatch (A+–F), BBB Wise Giving / Give.org (pass/fail accreditation), Giving What We Can (meta-evaluation of evaluators).

Forecasting & prediction platforms

Metaculus — many event types. Community-prediction probability; forecasters scored by proper rules. Crowd aggregation, no betting. Weakness: aggregate accuracy largely self-reported; no monetary incentive.
Good Judgment / GJ Open — geopolitics/economics. Probabilities scored by Brier; curated Superforecasters. Weakness: premium forecasts paywalled; small expert panel.
Polymarket — real-world events. Market price = probability; real-money crypto. Weakness: past US legal issues; thin-market manipulation.
Kalshi — US event contracts. Binary $0–$1; CFTC-regulated exchange. Weakness: much volume is sports, not forecasting.
Manifold — user-created markets. Play-money market maker. Weakness: play money weakens incentives; creator-resolved miscalibration.
PredictIt — US politics. Real-money, academic project. Weakness: position/withdrawal caps distort prices.

Lessons for evaluation engineering

These five patterns are the headlines; Patterns & Failure Modes develops each one rigorously, with the academic literature (reactivity, Goodhart’s law, certification economics, reputation inflation, aggregation theory) behind it.

Patterns the catalogue makes hard to ignore:

Funding structure predicts trustworthiness better than methodology does. The most-trusted systems (Consumer Reports, Which?, Stiftung Warentest, IIHS) share a model — independent, buys its own units, refuses ads — not a method. The clearest failures (2008 credit ratings, Doing Business, BBB pay-for-grade) are capture stories, not technique stories. This is the trust-network and candidness problem in the wild, and it matches the audit/ratings literature.
Every output format is gameable, differently. Binary fresh/rotten invites bombing; 100-point scales inflate and compress; reciprocal two-sided ratings breed retaliation and inflation; self-declared certifications get faked. Choosing the output format is choosing your failure mode.
Crowd systems converge on the same arms race — fakes, astroturfing, review bombing — and the same defenses: purchase/stay verification, Bayesian/credibility weighting, recency weighting, and ML fraud detection. Verification of who is evaluating is the recurring fix.
Composite indices live or die on weighting, which is inherently contested (HDI’s three dimensions, ESG divergence, ranking methodology churn). The OECD composite-indicators handbook exists precisely because this is hard.
“Shallow but standardized” often beats “deep but bespoke” at scale — letter-grade hygiene inspections, 5-star crash tests, and star ratings change behavior precisely because they are cheap, comparable, and ubiquitous. That is the systems view’s accuracy × quantity × cost trade-off, already made by society many times over.

Meta-lists & further reading

Curated catalogues of evaluation systems (the “good lists” that already exist):

Wikipedia — List of international rankings — the best single index of country rankings by domain.
Wikipedia categories — Review websites, International rankings, Credit rating agencies, Certification marks.
Wikipedia overviews — Review aggregator, Reputation system, List of academic databases and search engines, Sustainability standards and certification, List of freedom indices.
Ecolabel Index — a directory of ~450+ ecolabels across ~200 countries.
Academic — Davis, Kingsbury & Merry, Governance by Indicators (Oxford, 2012) — the scholarly catalogue + critique of global indicators; Jøsang et al., A survey of trust and reputation systems (2007); Tadelis, Reputation and Feedback Systems in Online Platform Markets (2016).

See also Adjacent Fields & Literature for the academic disciplines behind these systems, and Related Work for QURI’s own evaluation tools.