Cleaner PDF text extraction reduces post-processing

PDF extraction now includes a sanitize_content option (default: True) to normalize fragmented text by collapsing excessive whitespace. By improving the default output quality, you’ll spend less time cleaning documents and deliver more reliable retrieval and analysis downstream.

‍

Details:

New parameter: sanitize_content=True by default
Disable when exact layout/spacing must be preserved for specialized parsing
No breaking changes; review behavior if you depend on raw spacing in existing pipelines