1. Performance Degradation Detection (Drift Detection)¶
Purpose
Methods for detecting, measuring and responding to quality degradation (drift) in AI systems.
When to use this?
You notice your AI system in production is performing differently than expected, or you want to proactively set up monitoring to detect quality degradation early.
1. Objective¶
Performance degradation (drift) is the phenomenon where the quality of an AI system deteriorates over time. This module describes how we detect, measure and respond to drift.
2. Types of Performance Degradation¶
Data Drift¶
What: The input the system receives changes relative to the data on which it was trained/tested.
Examples:
- New product categories not present in the knowledge base
- Changed language use by customers
- Seasonal demand patterns
Signals:
- Increase in "I don't know" answers
- Queries about unknown topics
- Changing query distribution
Concept Drift¶
What: The relationship between input and desired output changes, even if the input remains similar.
Examples:
- Price changes not updated in the knowledge base
- New policy requiring different answers
- Changing customer expectations
Signals:
- Correct answers are assessed as incorrect
- Increase in complaints despite unchanged test results
- Gap between validation and production feedback
Performance Degradation¶
What: The model itself changes (through provider updates) or degrades.
Examples:
- Provider update to a new model
- Changes in API behaviour
- Fine-tuned model loses quality
Signals:
- Sudden change in output style
- Changed latency or token usage
- Regression on previously working scenarios
Assumption Drift¶
What: The assumptions on which the AI system was built no longer hold due to changes in the environment, usage patterns or regulations.
Examples:
- User volume grows beyond assumed capacity
- Data distribution shifts compared to the original assumption
- New regulations (e.g. EU AI Act enforcement) make the current approach non-compliant
- Costs scale differently than assumed
Signals:
- Discrepancy between assumed and actual user profile
- Cost overruns without changes in functionality
- Compliance findings during audits
Action: Re-assess the assumptions in the Objective Card (section E) at every quarterly review or after significant changes in the operational landscape.
3. Detection Methods¶
Periodic Golden Set Testing¶
Approach: Run the Golden Set regularly in production.
| Risk Level | Frequency | Scope |
|---|---|---|
| Minimal | Monthly | Sample (25%) |
| Limited | Weekly | Full set |
| High | Daily/Continuous | Full set + additional |
What we measure:
- Factual accuracy (% correct)
- Relevance (average score)
- Refusal rate (adversarial)
- Comparison with baseline
Real-time Monitoring¶
Approach: Monitor production interactions for signals of drift.
Metrics to monitor:
| Metric | Threshold for alert |
|---|---|
| Error rate | > 1.5x baseline |
| "Don't know" answers | > 2x baseline |
| Latency | > 2x baseline |
| Token usage | > 1.5x baseline (cost indicator) |
| Negative feedback | > 2x baseline |
User Feedback Analysis¶
Approach: Collect and analyse feedback systematically.
Feedback channels:
- Thumbs up/down in interface
- Escalations to human staff
- Complaints via other channels
- Corrections by users
4. Thresholds¶
Based on Evidence Standards section 3.2:
Significant performance degradation occurs when:
| Criterion | Threshold |
|---|---|
| Factual accuracy | Drops ≥ 2 percentage points vs baseline |
| Relevance (1–5) | Drops ≥ 0.3 vs baseline |
| Major errors | Increases ≥ 50% over 2 measurement periods |
| Critical errors | > 0 = immediate action |
Alert levels:
| Level | Condition | Action |
|---|---|---|
| Green | Within baseline | Normal management |
| Yellow | Between baseline and threshold | Increased monitoring |
| Orange | Threshold exceeded | Investigation + mitigation plan |
| Red | Critical error or severe degradation | Escalation + possible rollback |
5. Response Protocol¶
On Yellow (Increased Monitoring)¶
- Increase measurement frequency
- Analyse trend (is it stable or worsening?)
- Identify possible causes
- Document findings
On Orange (Investigation)¶
- Perform root cause analysis
- Determine type of drift (data/concept/model)
- Draft mitigation plan
- Inform stakeholders
- Plan corrective action
On Red (Escalation)¶
- Escalate to Tech Lead and Guardian
- Consider rollback or temporary shutdown
- Activate incident process
- Communicate to users if relevant
- Document for lessons learned
6. Mitigation Strategies¶
Data Drift¶
| Cause | Mitigation |
|---|---|
| Knowledge base outdated | Update knowledge base, reindex |
| New topics | Extend knowledge base |
| Changed language use | Adjust prompts, update examples |
Concept Drift¶
| Cause | Mitigation |
|---|---|
| Policy changed | Update Steering Instructions |
| Expectations changed | Revise Goal Definition, update spec |
| External changes | Revise Hard Boundaries |
Performance Degradation¶
| Cause | Mitigation |
|---|---|
| Provider update | Regression test, adjust prompts |
| API changes | Update integration, provide fallback |
| Unexplained degradation | Contact provider, consider rollback |
7. Baseline Measurement¶
Recording the Baseline¶
At go-live, record the baseline:
| Metric | Value at go-live | Alert threshold |
|---|---|---|
| Factual acc. | 99.2% | \< 97.2% |
| Relevance | 4.4 | \< 4.1 |
| Major errors | 2/150 | > 3/150 |
| Latency (p95) (95th percentile — 95% of all requests are faster than this value) | 1.8s | > 3.6s |
Updating the Baseline¶
- After significant system changes
- After knowledge base expansion
- Minimum annual review
8. Monitoring Dashboard¶
Recommended visualisations:
| Visualisation | Purpose |
|---|---|
| Trend line metrics | Factual accuracy, relevance over time |
| Heatmap query categories | Identify problematic areas |
| Alert timeline | Overview of threshold breaches |
| Comparison with baseline | Current vs baseline |
9. Performance Degradation Monitoring Checklist¶
9. Performance Degradation Monitoring Checklist
- Baseline is recorded at go-live
- Periodic Golden Set testing is scheduled
- Real-time monitoring is active
- Thresholds are configured
- Alerting is linked to responsible parties
- Response protocol is documented and known
- Feedback channels are set up
10. Related Modules¶
- Evidence Standards
- Monitoring & Optimisation
- Incident Response
- Metrics Dashboards
- Agentic AI Engineering — Silent Degradation
- Pitfalls Catalogue
Next step: Set up the monitoring dashboard and define thresholds for your production environment → See also: Metrics & Dashboards