
Measure and improve your RAG system performance using Evaluation Studio's clone-and-compare workflow
After deploying your RAG system, continuous improvement is key. Evaluation Studio provides 8 comprehensive metrics to measure performance and compare different configurations side-by-side. This step shows you how to clone your project, make changes, and prove they work better.
Entry Point: Dashboard → Evaluation Studio
Prerequisites: At least one deployed RAG project with processed documents
Expected Outcome: Data-driven decisions about configuration changes
Real-world usage reveals performance issues that initial testing misses:
| Scenario | What You Change | What You Measure |
|---|---|---|
| Try different embedding model | Embedding model | Precision@K, Context Recall |
| Adjust chunking strategy | Chunk size, overlap | Answer Correctness, Faithfulness |
| Enable BM25 hybrid search | BM25 toggle | Precision@K, MRR |
| Tune Top-K retrieval | Top-K value | Answer Relevancy, Latency |
| Modify graph settings (GraphRAG) | Graph extraction params | Retrieval Accuracy |
The primary way to test configuration improvements is the clone-and-compare pattern: clone your existing project, make one change, evaluate both, and compare results.
Important: Changing only one variable at a time isolates the impact of each modification. Changing multiple settings makes it impossible to determine which change caused the improvement or degradation.
The comparison view presents a table showing each metric for both evaluations:
Expand any question to see:
Before comparison, the system verifies:
| Metric | What It Measures | Good Value | How to Improve |
|---|---|---|---|
| Precision@K | Relevance of top-K chunks | >0.7 | Better embedding, smaller chunks |
| Context Recall | Ground truth chunk found | >0.8 | Increase limit, lower threshold |
| MRR | First relevant result ranking | >0.6 | Better ranking, query expansion |
| Faithfulness | Answer supported by context | >0.8 | Better context assembly |
| Answer Relevancy | Answer addresses question | >0.7 | Better prompt, temperature tuning |
| Answer Correctness | Accuracy vs ground truth | >0.7 | Better retrieval overall |
| Retrieval Accuracy | Overall retrieval effectiveness | >0.7 | Multiple factors |
| Latency | Response time (ms) | <2000ms | Lower Top-K, disable BM25 |
For a deep dive into each metric, see the Evaluation Studio documentation.
Isolate the impact of each configuration change. If you change the embedding model AND chunk size simultaneously, you cannot determine which change caused the result.
Enough for statistical significance. With 50 questions, you can detect ~10% differences with 95% confidence.
Reuse the same questions for consistent comparison. Templates ensure apples-to-apples comparisons across iterations.
Build a performance history by naming evaluations clearly:
"Baseline - 2024-11 - Default Config"
"Iteration 1 - 2024-11 - BM25 Enabled"
"Iteration 2 - 2024-12 - Chunk Size 512"
The most representative test data comes from actual production queries. Collect user questions and add them to your test set.
Original Project: "Customer Support RAG" Cloned Project: "Customer Support RAG - BM25 Test" Change Made: Enabled BM25 hybrid search Test Questions: 30 (mix of factual, technical, and edge cases)
┌─────────────────────┬──────────┬──────────┬──────────┐
│ Metric │ Original │ BM25 │ Winner │
├─────────────────────┼──────────┼──────────┼──────────┤
│ Precision@K │ 0.72 │ 0.78 │ BM25 │
│ Context Recall │ 0.81 │ 0.85 │ BM25 │
│ MRR │ 0.65 │ 0.71 │ BM25 │
│ Faithfulness │ 0.83 │ 0.82 │ Original │
│ Answer Relevancy │ 0.76 │ 0.79 │ BM25 │
│ Answer Correctness │ 0.74 │ 0.78 │ BM25 │
│ Retrieval Accuracy │ 0.73 │ 0.77 │ BM25 │
│ Latency (ms) │ 1200 │ 1450 │ Original │
└─────────────────────┴──────────┴──────────┴──────────┘
Overall Winner: BM25 (6 vs 2 metrics)
Enable BM25 - the 250ms latency increase is acceptable for 5-8% improvement across most metrics. The trade-off favors accuracy over speed for this customer support use case.
Yes, if they share overlapping documents. The comparison validation checks for shared documents and matching questions (fuzzy matching at 85% similarity threshold).
Depends on question count and LLM speed:
Evaluations run asynchronously - you can navigate away and check results later.
Yes. Click the Export button on any completed evaluation to download results in CSV or JSON format. Exports include all aggregate metrics, per-question results, generated answers, and configuration details.
The comparison uses fuzzy matching at 85% similarity. Minor wording differences are handled automatically. If questions differ by more than 15%, use templates to ensure exact matches.
Use the insights from your comparison to make the next configuration change. The benchmark-and-iterate cycle is continuous:
After adopting a new configuration, monitor real-world metrics to validate that benchmark improvements translate to production results.