
How documents are chunked and embedded for vector search.
Simple RAG indexing converts raw documents into searchable vector embeddings through a linear pipeline: parse → preprocess → chunk → embed → store.
Extracts raw text from uploaded files (PDF, TXT, MD, CSV, DOCX, XLS).
| Setting | Effect | Default |
|---|---|---|
| Remove Non-ASCII | Strips special characters | false |
| Lowercase | Normalizes text case | false |
| Collapse Spaces | Removes extra whitespace | false |
| Strategy | When to Use |
|---|---|
| Fixed Size | General purpose, predictable chunks |
| Recursive | Preserves paragraph structure |
| Semantic | LLM-determined natural boundaries |
Converts each chunk into a dense vector. Model selected in wizard (e.g., text-embedding-3-small).
Stores embeddings with chunk metadata for fast similarity search.