
Set up KPIs and track RAG quality over time for confident iteration
Set up KPIs and track RAG quality over time using business-specific queries and measurable benchmarks. This step enables confident iteration and continuous improvement.
Entry Point: RAG project → Benchmark tab
Prerequisites: Configured RAG system (Steps 1-5 complete)
Expected Outcome: Measurable KPIs, baseline established, iteration workflow ready
❌ Changes are based on intuition ❌ Can't measure if improvements actually help ❌ Risk of breaking existing functionality ❌ No way to track progress over time
✅ Every change is measured against KPIs ✅ Confidence in improvements ✅ Catch regressions before production ✅ Data-driven decision making
Create a list of 20-50 queries representing real user questions:
| Query ID | Query | Category | Priority |
|---|---|---|---|
| Q001 | "What is the return policy?" | Policy | High |
| Q002 | "How do I reset my password?" | Support | High |
| Q003 | "What products integrate with Slack?" | Product | Medium |
| Q004 | "What is the enterprise pricing?" | Sales | High |
| Q005 | "How do I export my data?" | Features | Medium |
Tips for Query Selection:
For each query, document what a good answer should include:
Query Q001: "What is the return policy?"
Expected Answer Should Include:
- Return window (30 days)
- Condition requirements (unused, original packaging)
- Refund timeline (5-7 business days)
- Exception items (software, DVDs)
Accuracy Threshold: 80% (4 of 5 elements)
Query Q002: "How do I reset my password?"
Expected Answer Should Include:
- Password reset link location
- Step-by-step instructions
- Support contact if issues
Accuracy Threshold: 100% (all elements required)
| KPI | Target | How It's Measured |
|---|---|---|
| Answer Accuracy | > 90% | Human evaluation or LLM judge |
| Avg Similarity Score | > 0.75 | Average of top result scores |
| Response Time | < 2 seconds | API response latency |
| Source Coverage | > 80% | Relevant sources retrieved |
KPI Definitions:
1. Load Query Set
↓
2. Execute Each Query Through RAG Pipeline
↓
3. Collect Results (answers, scores, sources)
↓
4. Compare Against Expected Answers
↓
5. Calculate KPIs
↓
6. Generate Report
Benchmark Report - Project: Customer Support RAG
Date: 2024-01-15
Query Set: 50 business queries
─────────────────────────────────────────────────
Overall KPIs:
┌─────────────────────┬────────┬──────────┬─────────┐
│ KPI │ Target │ Actual │ Status │
├─────────────────────┼────────┼──────────┼─────────┤
│ Answer Accuracy │ > 90% │ 92% │ ✓ Pass │
│ Avg Similarity │ > 0.75 │ 0.82 │ ✓ Pass │
│ Response Time │ < 2s │ 1.4s │ ✓ Pass │
│ Source Coverage │ > 80% │ 85% │ ✓ Pass │
└─────────────────────┴────────┴──────────┴─────────┘
Query-Level Details:
┌─────────┬────────────┬───────────┬─────────┬────────┐
│ Query │ Category │ Accuracy │ Sim Score│ Status │
├─────────┼────────────┼───────────┼─────────┼────────┤
│ Q001 │ Policy │ 95% │ 0.89 │ ✓ Pass │
│ Q002 │ Support │ 88% │ 0.76 │ ⚠ Review│
│ Q003 │ Product │ 94% │ 0.85 │ ✓ Pass │
│ Q004 │ Sales │ 91% │ 0.81 │ ✓ Pass │
│ Q005 │ Features │ 89% │ 0.78 │ ✓ Pass │
└─────────┴────────────┴───────────┴─────────┴────────┘
Failed/Weak Queries:
- Q002: Answer missing password reset link instructions
Recommendation: Add password_reset.md to document sources
Overall KPIs:
Query-Level Details:
Current State: Baseline KPIs measured
↓
Proposed Change: "Switch to text-embedding-3-large"
↓
Run Benchmark: Execute same query set
↓
Compare Results:
┌─────────────┬───────────┬───────────┬────────────┐
│ KPI │ Before │ After │ Change │
├─────────────┼───────────┼───────────┼────────────┤
│ Accuracy │ 92% │ 94% │ +2% ✓ │
│ Sim Score │ 0.82 │ 0.87 │ +0.05 ✓ │
│ Resp Time │ 1.4s │ 1.8s │ -0.4s ⚠ │
└─────────────┴───────────┴───────────┴────────────┘
↓
Decision: Accuracy improvement worth slight latency increase
↓
Deploy Change
| Change | Expected KPI Impact | When to Do |
|---|---|---|
| Upgrade embedding model | ↑ Accuracy, ↑ Similarity, ↑ Latency | When accuracy < target |
| Increase Top-K | ↑ Context coverage, ↑ Latency | When answers lack detail |
| Enable BM25 | ↑ Accuracy for technical terms | Technical documentation |
| Add documents | ↑ Source coverage | When queries miss info |
| Adjust chunk size | Variable - test with benchmark | When scores inconsistent |
Deploy When:
Don't Deploy When:
✅ Run benchmarks after every significant change ✅ Include diverse query types (simple, complex, edge cases) ✅ Track benchmarks over time (trend analysis) ✅ Set realistic KPI targets based on use case ✅ Document benchmark changes (query additions, removals)
❌ Don't change benchmark queries frequently (breaks trend comparison) ❌ Don't optimize for KPIs at expense of user experience ❌ Don't ignore latency KPIs (accuracy isn't everything) ❌ Don't skip benchmark before production deployment
Development → Benchmark Test → KPI Threshold Check → Deploy
↓
If KPIs below threshold:
- Block deployment
- Investigate regression
- Fix and re-test
| Frequency | Scope | Purpose |
|---|---|---|
| Daily | Critical queries (10) | Catch major regressions |
| Weekly | Full query set (50+) | Track trends |
| Monthly | Extended set (100+) | Comprehensive analysis |
Track KPIs over time to identify:
Accuracy Trend (Last 4 Weeks):
Week 1: 89%
Week 2: 91% (+2%)
Week 3: 92% (+1%)
Week 4: 94% (+2%)
Trend: Improving ✓
Benchmarking is not a one-time activity. Make it part of your regular RAG development workflow:
After establishing benchmarking: