
How documents are chunked, embedded, and indexed with BM25.
Hybrid RAG indexing builds two parallel indexes: a vector embedding index for semantic search and a BM25 inverted index for keyword matching. This dual-index approach enables hybrid retrieval at query time.
Extracts raw text from uploaded files (PDF, TXT, MD, CSV, DOCX, XLS).
Same preprocessing options as Simple RAG, applied before both indexing paths.
Chunks are created once and fed to both the embedding model and BM25 indexer.
Converts each chunk into a dense vector for semantic search.
Builds an inverted index with TF-IDF weighting for exact keyword matching.