Skip to content

Cheatsheet — Golden Set

Source: Validation Report


What is a Golden Set?

A Golden Set is a fixed collection of input-output pairs with known, correct answers. It is the benchmark for measuring the quality of your AI system.


Minimum Composition

Criterion Minimum value Recommended
Number of examples 50 200+
Use case coverage 80% 100%
Edge cases 10% of set 20%
Raters per item 1 2–3 (inter-rater)
Update frequency On model change Quarterly

Build in 4 Steps

1. Collect real user queries (or synthetic if no data available)
2. Have domain experts establish correct outputs
3. Categorise by use case + difficulty level
4. Lock the set — modify only via formal process

Quality Thresholds

Metric Threshold (Go) Action on failure
Accuracy (classification) ≥ 85% Retrain or optimise prompts
F1-score ≥ 0.80 Check class imbalance
Human rating ≥ 4.0/5.0 Review prompt design
Hallucination rate ≤ 5% Improve RAG quality
Latency p95 (95th percentile — 95% of all requests are faster than this value) ≤ [budget] ms Consider model tiering

Pitfalls

Avoid these mistakes

  • Using the Golden Set as training data (contamination)
  • Not updating the set after domain changes (concept drift)
  • Including only happy-path cases (no edge cases)
  • Single rater per item (no inter-rater agreement)

Source for full approach: Validation report template