Logo IconGuided Mind
v2.4Sign In
User Workflow

Step 4 - Benchmark & Iterate

Measure and improve your RAG system performance using Evaluation Studio's clone-and-compare workflow

Purpose

After deploying your RAG system, continuous improvement is key. Evaluation Studio provides 8 comprehensive metrics to measure performance and compare different configurations side-by-side. This step shows you how to clone your project, make changes, and prove they work better.

Entry Point: Dashboard → Evaluation Studio

Prerequisites: At least one deployed RAG project with processed documents

Expected Outcome: Data-driven decisions about configuration changes

Why Benchmark After Deployment?

Real-world usage reveals performance issues that initial testing misses:

  • Real user queries differ from test queries - Users ask questions in unexpected ways
  • Performance degrades as documents grow - More data can impact retrieval quality
  • Configuration changes need validation - Prove improvements before committing
  • A/B testing requires data, not guesses - Compare configurations objectively

Common Iteration Scenarios

ScenarioWhat You ChangeWhat You Measure
Try different embedding modelEmbedding modelPrecision@K, Context Recall
Adjust chunking strategyChunk size, overlapAnswer Correctness, Faithfulness
Enable BM25 hybrid searchBM25 togglePrecision@K, MRR
Tune Top-K retrievalTop-K valueAnswer Relevancy, Latency
Modify graph settings (GraphRAG)Graph extraction paramsRetrieval Accuracy

The Clone-and-Compare Workflow

The primary way to test configuration improvements is the clone-and-compare pattern: clone your existing project, make one change, evaluate both, and compare results.

Step 1: Clone Your Existing Project

  1. Navigate to RAG Projects page
  2. Find your deployed project
  3. Click the "Clone" button
  4. Give the cloned project a descriptive name (e.g., "My Project - BM25 Test")
  5. The cloned project inherits all documents and base configuration

Step 2: Make Configuration Changes

  1. Open the cloned project
  2. Navigate to Pipeline Configuration
  3. Make ONE change at a time (e.g., enable BM25)
  4. Save configuration
  5. Re-process documents if needed

Important: Changing only one variable at a time isolates the impact of each modification. Changing multiple settings makes it impossible to determine which change caused the improvement or degradation.

Step 3: Create Evaluation Test Set

  1. Navigate to Dashboard → Evaluation Studio
  2. Click "New Evaluation"
  3. Select your original project
  4. Generate test questions (20-50 recommended) or import your own
  5. Add ground truth answers for each question
  6. Check "Save as Template" for reuse

Step 4: Run Evaluation on Original Project

  1. Run the evaluation against the original project
  2. Wait for completion (status changes from "running" to "completed")
  3. Note the aggregate metrics as your baseline

Step 5: Run Same Evaluation on Cloned Project

  1. Create a new evaluation
  2. Select the cloned project
  3. Use the same test questions (from your saved template)
  4. Run evaluation
  5. Wait for completion

Step 6: Compare Results

  1. In Evaluation Studio, click the "Compare" button
  2. Select both evaluations
  3. System validates they share documents and questions
  4. View side-by-side comparison with metric differences
  5. Determine which configuration wins

Understanding Comparison Results

Aggregate Metrics Comparison

The comparison view presents a table showing each metric for both evaluations:

  • Difference column - Positive values mean Evaluation B is better (for most metrics)
  • Winner indicator - Shows which evaluation won each individual metric
  • Overall winner - Determined by which evaluation wins the most metrics

Per-Question Comparison

Expand any question to see:

  • Side-by-side generated answers from each configuration
  • Per-question metric differences
  • Retrieved chunks comparison
  • Highlights where one configuration significantly outperforms

Validation Checks

Before comparison, the system verifies:

  • Both evaluations are completed
  • Evaluations share overlapping documents
  • Evaluations have matching questions (fuzzy matching at 85% similarity)
  • Shows count of shared documents and matching questions

Key Metrics Overview

MetricWhat It MeasuresGood ValueHow to Improve
Precision@KRelevance of top-K chunks>0.7Better embedding, smaller chunks
Context RecallGround truth chunk found>0.8Increase limit, lower threshold
MRRFirst relevant result ranking>0.6Better ranking, query expansion
FaithfulnessAnswer supported by context>0.8Better context assembly
Answer RelevancyAnswer addresses question>0.7Better prompt, temperature tuning
Answer CorrectnessAccuracy vs ground truth>0.7Better retrieval overall
Retrieval AccuracyOverall retrieval effectiveness>0.7Multiple factors
LatencyResponse time (ms)<2000msLower Top-K, disable BM25

For a deep dive into each metric, see the Evaluation Studio documentation.

Best Practices

Change One Thing at a Time

Isolate the impact of each configuration change. If you change the embedding model AND chunk size simultaneously, you cannot determine which change caused the result.

Use 20-50 Test Questions

Enough for statistical significance. With 50 questions, you can detect ~10% differences with 95% confidence.

Include Diverse Query Types

  • Factual questions - Direct lookups
  • Multi-part questions - Require synthesis
  • Domain-specific queries - Industry terminology
  • Edge cases - Ambiguous or rare questions

Save Evaluations as Templates

Reuse the same questions for consistent comparison. Templates ensure apples-to-apples comparisons across iterations.

Track Improvements Over Time

Build a performance history by naming evaluations clearly:

"Baseline - 2024-11 - Default Config"
"Iteration 1 - 2024-11 - BM25 Enabled"
"Iteration 2 - 2024-12 - Chunk Size 512"

Test with Real User Queries

The most representative test data comes from actual production queries. Collect user questions and add them to your test set.

Example Comparison Session

Original Project: "Customer Support RAG" Cloned Project: "Customer Support RAG - BM25 Test" Change Made: Enabled BM25 hybrid search Test Questions: 30 (mix of factual, technical, and edge cases)

Comparison Results

┌─────────────────────┬──────────┬──────────┬──────────┐
│ Metric              │ Original │ BM25     │ Winner   │
├─────────────────────┼──────────┼──────────┼──────────┤
│ Precision@K         │    0.72  │   0.78   │  BM25    │
│ Context Recall      │    0.81  │   0.85   │  BM25    │
│ MRR                 │    0.65  │   0.71   │  BM25    │
│ Faithfulness        │    0.83  │   0.82   │ Original │
│ Answer Relevancy    │    0.76  │   0.79   │  BM25    │
│ Answer Correctness  │    0.74  │   0.78   │  BM25    │
│ Retrieval Accuracy  │    0.73  │   0.77   │  BM25    │
│ Latency (ms)        │   1200   │   1450   │ Original │
└─────────────────────┴──────────┴──────────┴──────────┘

Overall Winner: BM25 (6 vs 2 metrics)

Decision

Enable BM25 - the 250ms latency increase is acceptable for 5-8% improvement across most metrics. The trade-off favors accuracy over speed for this customer support use case.

FAQ

Can I compare evaluations from different projects?

Yes, if they share overlapping documents. The comparison validation checks for shared documents and matching questions (fuzzy matching at 85% similarity threshold).

How many questions do I need?

  • Minimum: 20 questions for basic trends
  • Recommended: 50 questions for reliable metrics
  • Comprehensive: 100+ questions for thorough benchmarking

How long does an evaluation take?

Depends on question count and LLM speed:

  • 10 questions: ~2-3 minutes
  • 50 questions: ~10-15 minutes
  • 100 questions: ~20-30 minutes

Evaluations run asynchronously - you can navigate away and check results later.

Can I export comparison results?

Yes. Click the Export button on any completed evaluation to download results in CSV or JSON format. Exports include all aggregate metrics, per-question results, generated answers, and configuration details.

What if questions don't match exactly?

The comparison uses fuzzy matching at 85% similarity. Minor wording differences are handled automatically. If questions differ by more than 15%, use templates to ensure exact matches.

Next Steps

Continue Iterating

Use the insights from your comparison to make the next configuration change. The benchmark-and-iterate cycle is continuous:

  1. Establish baseline
  2. Make one change
  3. Evaluate and compare
  4. Decide and repeat

Monitor Production Performance

After adopting a new configuration, monitor real-world metrics to validate that benchmark improvements translate to production results.

What to Track

  1. Baseline metrics from your original evaluation
  2. Configuration changes made in each iteration
  3. Comparison results with clear winner determination
  4. Production feedback from actual users

Tips for Success

  1. Name evaluations clearly - Include date and configuration details
  2. Use templates - Ensures consistent test sets across comparisons
  3. Document decisions - Record why you adopted or rejected each change
  4. Test incrementally - Small changes are easier to validate
  5. Consider trade-offs - Higher accuracy may mean higher latency