RAG Types

Hybrid RAG — Indexing Pipeline

How documents are chunked, embedded, and indexed with BM25.

Overview

Hybrid RAG indexing builds two parallel indexes: a vector embedding index for semantic search and a BM25 inverted index for keyword matching. This dual-index approach enables hybrid retrieval at query time.

Pipeline Steps

1. Document Parser

Extracts raw text from uploaded files (PDF, TXT, MD, CSV, DOCX, XLS).

2. Text Preprocessing

Same preprocessing options as Simple RAG, applied before both indexing paths.

3. Chunking Strategy

Chunks are created once and fed to both the embedding model and BM25 indexer.

4. Embedding Model (Vector Index)

Converts each chunk into a dense vector for semantic search.

5. BM25 Index Builder (Keyword Index)

Builds an inverted index with TF-IDF weighting for exact keyword matching.

Key Differences from Simple RAG

BM25 Index Builder runs in parallel with Embedding Model
Two storage targets: Vector DB + BM25 inverted index
TF-IDF weighting applied during BM25 index construction
Slightly longer indexing time due to dual index building

← PreviousHybrid RAG Next →Hybrid RAG: Retrieval