User Workflow

Step 4 - Benchmark & Iterate

Measure and improve your RAG system performance using Evaluation Studio's clone-and-compare workflow

Purpose

After deploying your RAG system, continuous improvement is key. Evaluation Studio provides 8 comprehensive metrics to measure performance and compare different configurations side-by-side. This step shows you how to clone your project, make changes, and prove they work better.

Entry Point: Dashboard → Evaluation Studio

Prerequisites: At least one deployed RAG project with processed documents

Expected Outcome: Data-driven decisions about configuration changes

Why Benchmark After Deployment?

Real-world usage reveals performance issues that initial testing misses:

Real user queries differ from test queries - Users ask questions in unexpected ways
Performance degrades as documents grow - More data can impact retrieval quality
Configuration changes need validation - Prove improvements before committing
A/B testing requires data, not guesses - Compare configurations objectively

Common Iteration Scenarios

Scenario	What You Change	What You Measure
Try different embedding model	Embedding model	Precision@K, Context Recall
Adjust chunking strategy	Chunk size, overlap	Answer Correctness, Faithfulness
Enable BM25 hybrid search	BM25 toggle	Precision@K, MRR
Tune Top-K retrieval	Top-K value	Answer Relevancy, Latency
Modify graph settings (GraphRAG)	Graph extraction params	Retrieval Accuracy

The Clone-and-Compare Workflow

The primary way to test configuration improvements is the clone-and-compare pattern: clone your existing project, make one change, evaluate both, and compare results.

Step 1: Clone Your Existing Project

Navigate to RAG Projects page
Find your deployed project
Click the "Clone" button
Give the cloned project a descriptive name (e.g., "My Project - BM25 Test")
The cloned project inherits all documents and base configuration

Step 2: Make Configuration Changes

Open the cloned project
Navigate to Pipeline Configuration
Make ONE change at a time (e.g., enable BM25)
Save configuration
Re-process documents if needed

Important: Changing only one variable at a time isolates the impact of each modification. Changing multiple settings makes it impossible to determine which change caused the improvement or degradation.

Step 3: Create Evaluation Test Set

Navigate to Dashboard → Evaluation Studio
Click "New Evaluation"
Select your original project
Generate test questions (20-50 recommended) or import your own
Add ground truth answers for each question
Check "Save as Template" for reuse

Step 4: Run Evaluation on Original Project

Run the evaluation against the original project
Wait for completion (status changes from "running" to "completed")
Note the aggregate metrics as your baseline

Step 5: Run Same Evaluation on Cloned Project

Create a new evaluation
Select the cloned project
Use the same test questions (from your saved template)
Run evaluation
Wait for completion

Step 6: Compare Results

In Evaluation Studio, click the "Compare" button
Select both evaluations
System validates they share documents and questions
View side-by-side comparison with metric differences
Determine which configuration wins

Understanding Comparison Results

Aggregate Metrics Comparison

The comparison view presents a table showing each metric for both evaluations:

Difference column - Positive values mean Evaluation B is better (for most metrics)
Winner indicator - Shows which evaluation won each individual metric
Overall winner - Determined by which evaluation wins the most metrics

Per-Question Comparison

Expand any question to see:

Side-by-side generated answers from each configuration
Per-question metric differences
Retrieved chunks comparison
Highlights where one configuration significantly outperforms

Validation Checks

Before comparison, the system verifies:

Both evaluations are completed
Evaluations share overlapping documents
Evaluations have matching questions (fuzzy matching at 85% similarity)
Shows count of shared documents and matching questions

Key Metrics Overview

Metric	What It Measures	Good Value	How to Improve
Precision@K	Relevance of top-K chunks	>0.7	Better embedding, smaller chunks
Context Recall	Ground truth chunk found	>0.8	Increase limit, lower threshold
MRR	First relevant result ranking	>0.6	Better ranking, query expansion
Faithfulness	Answer supported by context	>0.8	Better context assembly
Answer Relevancy	Answer addresses question	>0.7	Better prompt, temperature tuning
Answer Correctness	Accuracy vs ground truth	>0.7	Better retrieval overall
Retrieval Accuracy	Overall retrieval effectiveness	>0.7	Multiple factors
Latency	Response time (ms)	<2000ms	Lower Top-K, disable BM25

For a deep dive into each metric, see the Evaluation Studio documentation.

Best Practices

Change One Thing at a Time

Isolate the impact of each configuration change. If you change the embedding model AND chunk size simultaneously, you cannot determine which change caused the result.

Use 20-50 Test Questions

Enough for statistical significance. With 50 questions, you can detect ~10% differences with 95% confidence.

Include Diverse Query Types

Factual questions - Direct lookups
Multi-part questions - Require synthesis
Domain-specific queries - Industry terminology
Edge cases - Ambiguous or rare questions

Save Evaluations as Templates

Reuse the same questions for consistent comparison. Templates ensure apples-to-apples comparisons across iterations.

Track Improvements Over Time

Build a performance history by naming evaluations clearly:

"Baseline - 2024-11 - Default Config"
"Iteration 1 - 2024-11 - BM25 Enabled"
"Iteration 2 - 2024-12 - Chunk Size 512"

Test with Real User Queries

The most representative test data comes from actual production queries. Collect user questions and add them to your test set.

Example Comparison Session

Scenario: Testing BM25 Hybrid Search

Original Project: "Customer Support RAG" Cloned Project: "Customer Support RAG - BM25 Test" Change Made: Enabled BM25 hybrid search Test Questions: 30 (mix of factual, technical, and edge cases)

Comparison Results

┌─────────────────────┬──────────┬──────────┬──────────┐
│ Metric              │ Original │ BM25     │ Winner   │
├─────────────────────┼──────────┼──────────┼──────────┤
│ Precision@K         │    0.72  │   0.78   │  BM25    │
│ Context Recall      │    0.81  │   0.85   │  BM25    │
│ MRR                 │    0.65  │   0.71   │  BM25    │
│ Faithfulness        │    0.83  │   0.82   │ Original │
│ Answer Relevancy    │    0.76  │   0.79   │  BM25    │
│ Answer Correctness  │    0.74  │   0.78   │  BM25    │
│ Retrieval Accuracy  │    0.73  │   0.77   │  BM25    │
│ Latency (ms)        │   1200   │   1450   │ Original │
└─────────────────────┴──────────┴──────────┴──────────┘

Overall Winner: BM25 (6 vs 2 metrics)

Decision

Enable BM25 - the 250ms latency increase is acceptable for 5-8% improvement across most metrics. The trade-off favors accuracy over speed for this customer support use case.

FAQ

Can I compare evaluations from different projects?

Yes, if they share overlapping documents. The comparison validation checks for shared documents and matching questions (fuzzy matching at 85% similarity threshold).

How many questions do I need?

Minimum: 20 questions for basic trends
Recommended: 50 questions for reliable metrics
Comprehensive: 100+ questions for thorough benchmarking

How long does an evaluation take?

Depends on question count and LLM speed:

10 questions: ~2-3 minutes
50 questions: ~10-15 minutes
100 questions: ~20-30 minutes

Evaluations run asynchronously - you can navigate away and check results later.

Can I export comparison results?

Yes. Click the Export button on any completed evaluation to download results in CSV or JSON format. Exports include all aggregate metrics, per-question results, generated answers, and configuration details.

What if questions don't match exactly?

The comparison uses fuzzy matching at 85% similarity. Minor wording differences are handled automatically. If questions differ by more than 15%, use templates to ensure exact matches.

Next Steps

Continue Iterating

Use the insights from your comparison to make the next configuration change. The benchmark-and-iterate cycle is continuous:

Establish baseline
Make one change
Evaluate and compare
Decide and repeat

Monitor Production Performance

After adopting a new configuration, monitor real-world metrics to validate that benchmark improvements translate to production results.

What to Track

Baseline metrics from your original evaluation
Configuration changes made in each iteration
Comparison results with clear winner determination
Production feedback from actual users

Tips for Success

Name evaluations clearly - Include date and configuration details
Use templates - Ensures consistent test sets across comparisons
Document decisions - Record why you adopted or rejected each change
Test incrementally - Small changes are easier to validate
Consider trade-offs - Higher accuracy may mean higher latency

← PreviousStep 3: Deploy Next →Introduction