Skip to content

1. Test Frameworks

Purpose

Testing approach for AI systems that combines deterministic tests with evaluation of probabilistic behaviour.

1. Purpose

This module defines how we test AI systems. Unlike traditional software, AI requires a combination of deterministic tests and evaluation of probabilistic behaviour.


2. Test Levels

Component Tests (Unit Tests)

Testing individual components in isolation.

What we test:

  • Data transformation functions (input → expected output)
  • Prompt parsing and formatting
  • API integration code (with mocks)
  • Error handling (edge cases)

Characteristics:

  • Fast to execute (seconds)
  • Deterministic (same input = same result)
  • Automatic at every code change

Integration Tests

Testing the cooperation between components.

What we test:

  • End-to-end flow from input to output
  • Integration with external systems (databases, APIs)
  • Data validation in the full pipeline

Characteristics:

  • Slower than unit tests (minutes)
  • May require external dependencies
  • Periodic or at important changes

AI Behaviour Tests (Golden Set)

Testing AI behaviour on representative scenarios.

What we test:

  • Factuality and relevance of answers
  • Compliance with Hard Boundaries
  • Consistency over multiple runs
  • Performance per user group (fairness)

Characteristics:

  • Requires human assessment or automated evaluation
  • Variation possible due to probabilistic nature
  • Mandatory for every Gate Review

3. The Golden Set

The Golden Set is the central test set for AI behaviour. See Evidence Standards for minimum requirements per risk level.

Composition

Category Description Minimum %
Standard cases Typical, realistic scenarios 70-80%
Complex cases Edge cases, multi-step questions 15-20%
Adversarial cases Jailbreaks, prompt injection, policy bypassing 5-10%
Fairness cases Scenarios per relevant user group As needed

Format per Test Case

Field Description
ID Unique identification (e.g. GS-001)
Category Standard / Complex / Adversarial / Fairness
Input The exact prompt or question
Expected outcome Correct answer or assessment criteria
Assessment method Exact match / Keywords / Human assessment
Critical? Yes/No (Critical error if incorrect?)

Maintenance

  • Golden Set is periodically reviewed (minimum per release)
  • New scenarios are added at incidents or new functionality
  • Outdated cases are removed or updated

4. Adversarial Testing

Specific tests to validate safety and robustness.

Required Adversarial Scenarios

Scenario Description Expected Behaviour
Jailbreak Attempt to ignore instructions Refusal
Prompt injection Hidden instructions in user input Ignore instruction
Policy bypass Cleverly circumventing Hard Boundaries Refusal
Source fabrication "Make up a source" or "pretend" Refusal
PII extraction Attempt to retrieve training data Refusal
Tool abuse / privilege escalation Attempt to obtain higher rights or perform unauthorised actions via tools Refusal + logging
Data exfiltration via tool output Attempt to extract sensitive data via tool responses or artefacts Blocking + alert
Retrieval poisoning Injection of malicious sources into knowledge base to manipulate output Detection (monitoring) + blocking/refusal + logging
Action injection Manipulation of tool schemas to trigger unintended actions Schema validation + refusal

Sources: [so-1], [so-10]

Execution

  • Minimal Risk: Qualitative sampling by Guardian
  • Limited Risk: Structured adversarial set (minimum 5% of Golden Set)
  • High Risk: Extended adversarial testing + external red team where relevant

5. Regression Testing

Automatically repeating tests at changes to detect degradation.

What Triggers Regression Tests?

Change Regression test level
Code change Component tests + Integration tests
Prompt change Integration tests + Golden Set sample
Model version update Full Golden Set
Data source change Full Golden Set + Fairness

Automation

Level Approach Tooling examples
L0 Manual execution at release Spreadsheet tracking
L1 Scheduled periodic tests Cron jobs, CI scheduled
L2 Automatic at every commit GitHub Actions, GitLab CI
L3 Continuous testing with quality gates MLflow, custom pipelines

6. Evaluation Metrics

Metric Application Calculation
Factuality Factual correctness % correct / total
Relevance Answer fits question Average score (1-5 scale)
Consistency Stability over runs Standard deviation over N runs
Refusal rate Adversarial scenarios % correctly refused
Fairness Difference between groups Max difference in error rate

7. Test Framework Checklist

7. Test Framework Checklist

  • Component tests cover critical functions
  • Integration tests validate end-to-end flow
  • Golden Set is composed according to Evidence Standards
  • Adversarial scenarios are defined and tested
  • Regression test strategy is documented
  • Evaluation metrics are defined
  • Test results are recorded in Validation Report