v2.5.3
February 19, 2026
Cleaner PDF text extraction reduces post-processing
PDF extraction now includes a sanitize_content option (default: True) to normalize fragmented text by collapsing excessive whitespace. By improving the default output quality, you’ll spend less time cleaning documents and deliver more reliable retrieval and analysis downstream.
Details
- New parameter:
sanitize_content=Trueby default - Disable when exact layout/spacing must be preserved for specialized parsing
- No breaking changes; review behavior if you depend on raw spacing in existing pipelines
