1. Performance Degradation Detection (Drift Detection)¶

Purpose

Methods for detecting, measuring and responding to quality degradation (drift) in AI systems.

When to use this?

You notice your AI system in production is performing differently than expected, or you want to proactively set up monitoring to detect quality degradation early.

1. Objective¶

Performance degradation (drift) is the phenomenon where the quality of an AI system deteriorates over time. This module describes how we detect, measure and respond to drift.

2. Types of Performance Degradation¶

Data Drift¶

What: The input the system receives changes relative to the data on which it was trained/tested.

Examples:

New product categories not present in the knowledge base
Changed language use by customers
Seasonal demand patterns

Signals:

Increase in "I don't know" answers
Queries about unknown topics
Changing query distribution

Concept Drift¶

What: The relationship between input and desired output changes, even if the input remains similar.

Examples:

Price changes not updated in the knowledge base
New policy requiring different answers
Changing customer expectations

Signals:

Correct answers are assessed as incorrect
Increase in complaints despite unchanged test results
Gap between validation and production feedback

Performance Degradation¶

What: The model itself changes (through provider updates) or degrades.

Examples:

Provider update to a new model
Changes in API behaviour
Fine-tuned model loses quality

Signals:

Sudden change in output style
Changed latency or token usage
Regression on previously working scenarios

Assumption Drift¶

What: The assumptions on which the AI system was built no longer hold due to changes in the environment, usage patterns or regulations.

Examples:

User volume grows beyond assumed capacity
Data distribution shifts compared to the original assumption
New regulations (e.g. EU AI Act enforcement) make the current approach non-compliant
Costs scale differently than assumed

Signals:

Discrepancy between assumed and actual user profile
Cost overruns without changes in functionality
Compliance findings during audits

Action: Re-assess the assumptions in the Objective Card (section E) at every quarterly review or after significant changes in the operational landscape.

3. Detection Methods¶

Periodic Golden Set Testing¶

Approach: Run the Golden Set regularly in production.

Risk Level	Frequency	Scope
Minimal	Monthly	Sample (25%)
Limited	Weekly	Full set
High	Daily/Continuous	Full set + additional

What we measure:

Factual accuracy (% correct)
Relevance (average score)
Refusal rate (adversarial)
Comparison with baseline

Real-time Monitoring¶

Approach: Monitor production interactions for signals of drift.

Metrics to monitor:

Metric	Threshold for alert
Error rate	> 1.5x baseline
"Don't know" answers	> 2x baseline
Latency	> 2x baseline
Token usage	> 1.5x baseline (cost indicator)
Negative feedback	> 2x baseline

User Feedback Analysis¶

Approach: Collect and analyse feedback systematically.

Feedback channels:

Thumbs up/down in interface
Escalations to human staff
Complaints via other channels
Corrections by users

4. Thresholds¶

Based on Evidence Standards section 3.2:

Significant performance degradation occurs when:

Criterion	Threshold
Factual accuracy	Drops ≥ 2 percentage points vs baseline
Relevance (1–5)	Drops ≥ 0.3 vs baseline
Major errors	Increases ≥ 50% over 2 measurement periods
Critical errors	> 0 = immediate action

Alert levels:

Level	Condition	Action
Green	Within baseline	Normal management
Yellow	Between baseline and threshold	Increased monitoring
Orange	Threshold exceeded	Investigation + mitigation plan
Red	Critical error or severe degradation	Escalation + possible rollback

5. Response Protocol¶

On Yellow (Increased Monitoring)¶

Increase measurement frequency
Analyse trend (is it stable or worsening?)
Identify possible causes
Document findings

On Orange (Investigation)¶

On Red (Escalation)¶

Escalate to Tech Lead and Guardian
Consider rollback or temporary shutdown
Activate incident process
Communicate to users if relevant
Document for lessons learned

6. Mitigation Strategies¶

Data Drift¶

Cause	Mitigation
Knowledge base outdated	Update knowledge base, reindex
New topics	Extend knowledge base
Changed language use	Adjust prompts, update examples

Concept Drift¶

Cause	Mitigation
Policy changed	Update Steering Instructions
Expectations changed	Revise Goal Definition, update spec
External changes	Revise Hard Boundaries

Performance Degradation¶

Cause	Mitigation
Provider update	Regression test, adjust prompts
API changes	Update integration, provide fallback
Unexplained degradation	Contact provider, consider rollback

7. Baseline Measurement¶

Recording the Baseline¶

At go-live, record the baseline:

Metric	Value at go-live	Alert threshold
Factual acc.	99.2%	\< 97.2%
Relevance	4.4	\< 4.1
Major errors	2/150	> 3/150
Latency (p95) (95th percentile — 95% of all requests are faster than this value)	1.8s	> 3.6s

Updating the Baseline¶

After significant system changes
After knowledge base expansion
Minimum annual review

8. Monitoring Dashboard¶

Recommended visualisations:

Visualisation	Purpose
Trend line metrics	Factual accuracy, relevance over time
Heatmap query categories	Identify problematic areas
Alert timeline	Overview of threshold breaches
Comparison with baseline	Current vs baseline

9. Performance Degradation Monitoring Checklist¶

9. Performance Degradation Monitoring Checklist

Baseline is recorded at go-live
Periodic Golden Set testing is scheduled
Real-time monitoring is active
Thresholds are configured
Alerting is linked to responsible parties
Response protocol is documented and known
Feedback channels are set up

Next step: Set up the monitoring dashboard and define thresholds for your production environment → See also: Metrics & Dashboards

Was this page helpful? Give feedback

1. Performance Degradation Detection (Drift Detection)¶

1. Objective¶

2. Types of Performance Degradation¶

Data Drift¶

Concept Drift¶

Performance Degradation¶

Assumption Drift¶

3. Detection Methods¶

Periodic Golden Set Testing¶

Real-time Monitoring¶

User Feedback Analysis¶

4. Thresholds¶

5. Response Protocol¶

On Yellow (Increased Monitoring)¶

On Orange (Investigation)¶

On Red (Escalation)¶

6. Mitigation Strategies¶

Data Drift¶

Concept Drift¶

Performance Degradation¶

7. Baseline Measurement¶

Recording the Baseline¶

Updating the Baseline¶

8. Monitoring Dashboard¶

9. Performance Degradation Monitoring Checklist¶

10. Related Modules¶