
Configure chunking strategy and text processing in Step 3 of the wizard.
Chunking divides your documents into smaller pieces that can be embedded and searched efficiently. The chunking strategy you choose directly impacts retrieval quality.
| Method | Best For | Chunk Size | Overlap |
|---|---|---|---|
| Fixed-Size | Uniform documents, manuals | 256-2048 tokens | 10-20% |
| Recursive | Mixed content, technical docs | 256-2048 tokens | 10-20% |
| Semantic | Articles, blog posts | Variable | N/A |
| Document | Short docs, FAQ entries | Full document | N/A |
Splits text into equal-sized chunks with configurable overlap. Simple and predictable.
Config example:
{
"chunking": {
"method": "fixed",
"chunk_size": 512,
"chunk_overlap": 50,
"chunk_min_size": 128,
"chunk_max_size": 768
}
}Best for:
Splits text by separators (paragraphs → sentences → words) to maintain semantic boundaries.
Config example:
{
"chunking": {
"method": "recursive",
"chunk_size": 1024,
"chunk_overlap": 100,
"separators": ["\n\n", "\n", ". ", " "]
}
}Best for:
Uses embedding similarity to detect topic boundaries and split where meaning changes.
Best for:
Treats each document (or major section) as a single chunk.
Best for:
| Option | Description | Default |
|---|---|---|
| Remove extra whitespace | Collapse multiple spaces/newlines | Enabled |
| Normalize Unicode | Convert special characters to standard | Enabled |
| Strip HTML tags | Remove HTML markup from content | Enabled |
| OCR for scanned PDFs | Extract text from images in PDFs | Disabled |
Enable OCR only for scanned PDFs. OCR adds significant processing time and is unnecessary for text-extractable PDFs.
| Filter | Description |
|---|---|
| Skip empty chunks | Remove chunks with no meaningful content |
| Skip boilerplate | Filter out headers, footers, page numbers |
| Minimum chunk size | Drop chunks below token threshold |
After configuring chunking, use the Chunk Preview tool to see how your documents will be split:
The preview shows:
Use the Similarity Score Test tool to verify your chunking produces good retrieval:
Good scores:
0.85+ — Excellent match0.70-0.84 — Good match0.50-0.69 — AcceptableBelow 0.50 — Review chunking or content qualityLow similarity scores usually mean chunks are too large (diluting meaning) or too small (losing context). Try adjusting chunk_size and overlap.
| Document Type | Method | Chunk Size | Overlap |
|---|---|---|---|
| Technical Manual | Recursive | 512-1024 | 50-100 |
| Legal Contract | Fixed | 256-512 | 25-50 |
| Blog Post | Semantic | Variable | N/A |
| FAQ Entry | Document | Full doc | N/A |
| Research Paper | Recursive | 1024-2048 | 100-200 |
| CSV Data | Fixed | 128-256 | 0 |
Pass chunking config when uploading:
curl -X POST "https://api.guidedmind.ai/rag/upload-and-process" \
-H "X-API-Key: rk_your_key_here" \
-F "file=@document.pdf" \
-F 'config={
"chunking": {
"method": "recursive",
"chunk_size": 512,
"chunk_overlap": 50
}
}'After configuring chunking, move to Pipeline Configuration to select your embedding model and retrieval method.