1. Evidence Standards¶
Purpose
Definition of minimum evidence standards so that Gate Reviews are based on verifiable criteria rather than intuition.
When to use this?
You are preparing a Gate Review and want to know what evidence you need to collect for your project's risk level and collaboration mode.
1. Objective¶
This module defines minimum evidence standards for AI solutions, so that Gate Reviews are based on verifiable criteria rather than intuition.
The evidence for an AI system consists of a coherent set of documents and log data that together provide insight into: what the system was supposed to do, how its behaviour was steered, how it was tested and what happened in practice. This coherence enables assessment, auditing and incident analysis.
Core principle: An AI solution may only proceed to the next phase when the evidence meets the standards for the chosen risk level (see Risk Management & Compliance) and Collaboration Mode (see AI Collaboration Modes).
2. Scope (what does this apply to?)¶
These standards apply to:
- Generative AI (text/image/advice)
- AI performing classification/extraction
- AI supporting decisions (advisory) or executing them (agent/action)
Not intended for:
- Pure BI reporting without AI decision-making
- Simple rules/automation without a model
3. Definitions (to make terms verifiable)¶
Error Classification¶
- Critical: violation of Hard Boundaries (privacy breach, prohibited advice, discriminatory output, dangerous instructions, misleading transparency). Norm: 0 permitted.
- Major: substantively incorrect with a real risk of harm or wrong decision. Norm: very limited (see table).
- Minor: style/format/minor incompleteness without decision impact.
"Significant Performance Degradation"¶
Performance degradation is significant if any of the following occurs relative to the baseline:
- Factual accuracy drops ≥ 2 percentage points (e.g. from 99% to 97%)
- Relevance score drops ≥ 0.3 on a 1–5 scale
- Number of Major errors increases ≥ 50% over two consecutive measurement periods
(Note: precise thresholds may be stricter per use case, but not more lenient without explicit approval from the Guardian.)
4. Required evidence (evidence pack)¶
Each Gate Review is based at minimum on these documents:
- Golden Set Test & Acceptance Protocol (the approach)
- Validation Report (the results + conclusion)
- Technical Model Card (what is actually running)
- Goal Definition (what it was supposed to do + Hard Boundaries)
- Risk Pre-Scan (risk class)
5. Minimum requirements for test sets ("Golden Set")¶
| Risk Level | Minimum Golden Set size | Required components |
|---|---|---|
| Minimal | 20 cases | 80% standard cases + 20% edge cases |
| Limited | 50 cases | 80% standard + 15% complex + 5% adversarial |
| High | 150 cases | 70% standard + 20% complex + 10% adversarial + fairness set |
Additional rules (all levels):
- Test cases are realistic real-world examples (not synthetic "happy flow only").
- Each test case has: expected outcome or assessment criteria.
- Adversarial set explicitly includes: jailbreaks, prompt injection, policy circumvention, "invent a source" tricks.
- Synthetic Data Generation: To reduce the workload of 150+ test cases, a "red-teaming AI" may be used to generate draft test cases. Requirement: A human expert must validate and approve each generated test case and the "expected answer" (Ground Truth) before inclusion in the Golden Set.
6. Measurement criteria and minimum standards (per risk level)¶
If your use case has no "accuracy" (e.g. generative text), use "Factual accuracy", "Completeness" and "Relevance" as primary measures.
Standards Table¶
| Criterion | Minimal risk | Limited risk | High risk |
|---|---|---|---|
| Critical errors | 0 | 0 | 0 |
| Major errors (max) | ≤ 2 in test set | ≤ 1 in test set | ≤ 0–1 in test set (Guardian decides) |
| Factual accuracy (no factual inaccuracies) | ≥ 98% | ≥ 99% | ≥ 99.5% |
| Relevance (1–5) | ≥ 4.0 | ≥ 4.2 | ≥ 4.5 |
| Safety: "must refuse" prompts | 100% rejection | 100% rejection | 100% rejection |
| Transparency (AI disclaimer where required) | n/a or 100% if external | 100% where applicable | 100% where applicable |
| Fairness check (bias) | qualitative (Guardian) | qual + quant where possible | required quant + mitigation plan |
| Audit trail (logging completeness) | minimal metadata | 100% metadata + output sampling | 100% input/output + traceable context |
| Stability (variation across runs) | monitor | limited variation permitted | strict: variation must be explained/acceptable |
Fairness (bias) — minimum norm (brief and verifiable)¶
- Limited: if relevant groups can be distinguished, then: difference in Major error rate between groups ≤ 10%.
- High: difference in Major error rate between groups ≤ 5%, plus described mitigation where deviations exist.
(If group labels are absent or privacy-sensitive: Guardian determines a qualitative check + mitigation.)
7. Logging requirements (audit trail)¶
What do we log at minimum?¶
- Date/time, user/role (hashed ID where required)
- Use case / endpoint
- Model name + version
- Prompt/Steering Instructions version
- Sources used (for Knowledge coupling: document IDs/URLs)
- Output
- Human override (yes/no + reason)
Retention (baseline)¶
- Minimal/Limited: standard 90 days, unless otherwise required.
- High risk: standard 12 months (or longer if legally required).
(Align with privacy policy; pseudonymise where possible.)
8. Evidence per Gate (practical)¶
- Gate 1 (Go/No-Go Discovery) (to Evidence): 09.01 + 09.02 (draft) + 09.03 + Data Evaluation completed.
- Gate 2 (PoV Investment) (to Development): 09.06 (pilot results) + 09.04 (draft) + Guardian approval on Hard Boundaries.
- Gate 3 (Production-Ready) (to Go-live/Delivery): 09.06 (release candidate) meets standards from §6 + logging plan + incident procedure.
- Gate 4 (Go-live) (to Management): baseline recorded + monitoring/feedback loop set up.