Step 6 - Benchmarking & Iteration

Set up KPIs and track RAG quality over time for confident iteration

Step 6: Benchmarking & Iteration

Purpose

Set up KPIs and track RAG quality over time using business-specific queries and measurable benchmarks. This step enables confident iteration and continuous improvement.

Entry Point: RAG project → Benchmark tab

Prerequisites: Configured RAG system (Steps 1-5 complete)

Expected Outcome: Measurable KPIs, baseline established, iteration workflow ready

Why Benchmarking Matters

Without Benchmarking

❌ Changes are based on intuition ❌ Can't measure if improvements actually help ❌ Risk of breaking existing functionality ❌ No way to track progress over time

With Benchmarking

✅ Every change is measured against KPIs ✅ Confidence in improvements ✅ Catch regressions before production ✅ Data-driven decision making

Setting Up Your Benchmark

Step 1: Define Business Queries

Create a list of 20-50 queries representing real user questions:

Query ID	Query	Category	Priority
Q001	"What is the return policy?"	Policy	High
Q002	"How do I reset my password?"	Support	High
Q003	"What products integrate with Slack?"	Product	Medium
Q004	"What is the enterprise pricing?"	Sales	High
Q005	"How do I export my data?"	Features	Medium

Tips for Query Selection:

Include queries from each category (Policy, Support, Product, etc.)
Prioritize high-frequency user questions
Include edge cases and complex queries
Use actual user queries from logs if available

Step 2: Define Expected Answers

For each query, document what a good answer should include:

Query Q001: "What is the return policy?"

Expected Answer Should Include:
- Return window (30 days)
- Condition requirements (unused, original packaging)
- Refund timeline (5-7 business days)
- Exception items (software, DVDs)

Accuracy Threshold: 80% (4 of 5 elements)

Query Q002: "How do I reset my password?"

Expected Answer Should Include:
- Password reset link location
- Step-by-step instructions
- Support contact if issues

Accuracy Threshold: 100% (all elements required)

Step 3: Set KPI Targets

KPI	Target	How It's Measured
Answer Accuracy	> 90%	Human evaluation or LLM judge
Avg Similarity Score	> 0.75	Average of top result scores
Response Time	< 2 seconds	API response latency
Source Coverage	> 80%	Relevant sources retrieved

KPI Definitions:

Answer Accuracy: Percentage of expected elements present in answer
Avg Similarity Score: Average similarity score of retrieved chunks
Response Time: Time from query to complete response
Source Coverage: Percentage of relevant documents retrieved

Running Benchmarks

Benchmark Execution Flow

1. Load Query Set
   ↓
2. Execute Each Query Through RAG Pipeline
   ↓
3. Collect Results (answers, scores, sources)
   ↓
4. Compare Against Expected Answers
   ↓
5. Calculate KPIs
   ↓
6. Generate Report

Benchmark Report Example

Benchmark Report - Project: Customer Support RAG
Date: 2024-01-15
Query Set: 50 business queries
─────────────────────────────────────────────────

Overall KPIs:
┌─────────────────────┬────────┬──────────┬─────────┐
│ KPI                 │ Target │ Actual   │ Status  │
├─────────────────────┼────────┼──────────┼─────────┤
│ Answer Accuracy     │ > 90%  │ 92%      │ ✓ Pass  │
│ Avg Similarity      │ > 0.75 │ 0.82     │ ✓ Pass  │
│ Response Time       │ < 2s   │ 1.4s     │ ✓ Pass  │
│ Source Coverage     │ > 80%  │ 85%      │ ✓ Pass  │
└─────────────────────┴────────┴──────────┴─────────┘

Query-Level Details:
┌─────────┬────────────┬───────────┬─────────┬────────┐
│ Query   │ Category   │ Accuracy  │ Sim Score│ Status │
├─────────┼────────────┼───────────┼─────────┼────────┤
│ Q001    │ Policy     │ 95%       │ 0.89     │ ✓ Pass │
│ Q002    │ Support    │ 88%       │ 0.76     │ ⚠ Review│
│ Q003    │ Product    │ 94%       │ 0.85     │ ✓ Pass │
│ Q004    │ Sales      │ 91%       │ 0.81     │ ✓ Pass │
│ Q005    │ Features   │ 89%       │ 0.78     │ ✓ Pass │
└─────────┴────────────┴───────────┴─────────┴────────┘

Failed/Weak Queries:
- Q002: Answer missing password reset link instructions
  Recommendation: Add password_reset.md to document sources

Interpreting Results

Overall KPIs:

Green (✓ Pass): Meeting or exceeding target
Yellow (⚠ Review): Close to target, monitor
Red (✗ Fail): Below target, needs attention

Query-Level Details:

Identify specific queries causing issues
Pattern analysis by category
Prioritize fixes by query priority

Iteration Workflow

Making Confident Changes

Current State: Baseline KPIs measured
   ↓
Proposed Change: "Switch to text-embedding-3-large"
   ↓
Run Benchmark: Execute same query set
   ↓
Compare Results:
┌─────────────┬───────────┬───────────┬────────────┐
│ KPI         │ Before    │ After     │ Change     │
├─────────────┼───────────┼───────────┼────────────┤
│ Accuracy    │ 92%       │ 94%       │ +2% ✓      │
│ Sim Score   │ 0.82      │ 0.87      │ +0.05 ✓    │
│ Resp Time   │ 1.4s      │ 1.8s      │ -0.4s ⚠    │
└─────────────┴───────────┴───────────┴────────────┘
   ↓
Decision: Accuracy improvement worth slight latency increase
   ↓
Deploy Change

Common Iterations

Change	Expected KPI Impact	When to Do
Upgrade embedding model	↑ Accuracy, ↑ Similarity, ↑ Latency	When accuracy < target
Increase Top-K	↑ Context coverage, ↑ Latency	When answers lack detail
Enable BM25	↑ Accuracy for technical terms	Technical documentation
Add documents	↑ Source coverage	When queries miss info
Adjust chunk size	Variable - test with benchmark	When scores inconsistent

Iteration Decision Framework

Deploy When:

Primary KPIs improve (accuracy, similarity)
Secondary KPI impact acceptable (latency)
No regressions in critical queries

Don't Deploy When:

Primary KPIs decrease
Latency increase unacceptable
Critical queries regress

Benchmarking Best Practices

Do's

✅ Run benchmarks after every significant change ✅ Include diverse query types (simple, complex, edge cases) ✅ Track benchmarks over time (trend analysis) ✅ Set realistic KPI targets based on use case ✅ Document benchmark changes (query additions, removals)

Don'ts

❌ Don't change benchmark queries frequently (breaks trend comparison) ❌ Don't optimize for KPIs at expense of user experience ❌ Don't ignore latency KPIs (accuracy isn't everything) ❌ Don't skip benchmark before production deployment

Integration with Development Workflow

Before Production Deployment

Development → Benchmark Test → KPI Threshold Check → Deploy
                                    ↓
                            If KPIs below threshold:
                            - Block deployment
                            - Investigate regression
                            - Fix and re-test

Scheduled Benchmarking

Frequency	Scope	Purpose
Daily	Critical queries (10)	Catch major regressions
Weekly	Full query set (50+)	Track trends
Monthly	Extended set (100+)	Comprehensive analysis

Trend Analysis

Track KPIs over time to identify:

Gradual improvements or degradations
Impact of document additions
Seasonal patterns in query types

Accuracy Trend (Last 4 Weeks):
Week 1: 89%
Week 2: 91% (+2%)
Week 3: 92% (+1%)
Week 4: 94% (+2%)

Trend: Improving ✓

Continuous Improvement

Benchmarking is not a one-time activity. Make it part of your regular RAG development workflow:

Weekly

Review benchmark trends
Identify queries needing attention
Plan improvements

Per Change

Run benchmark before deploying
Compare against baseline
Document results

Monthly

Add new queries based on user feedback
Review and update KPI targets
Comprehensive analysis

Quarterly

Full benchmark review
Strategic planning
Goal setting for next quarter

Next Steps

After establishing benchmarking:

Deploy to Production with confidence
Monitor Continuously using established KPIs
Iterate Regularly based on benchmark results
Expand Benchmark as new use cases emerge

Tips for Success

Start Small: Begin with 10-20 critical queries
Automate: Run benchmarks automatically on changes
Visualize: Use charts for trend analysis
Share: Make benchmarks visible to team
Act: Use results to drive improvements

Next →Introduction