v2.5.3

February 19, 2026

Cleaner PDF text extraction reduces post-processing

PDF extraction now includes a sanitize_content option (default: True) to normalize fragmented text by collapsing excessive whitespace. By improving the default output quality, you’ll spend less time cleaning documents and deliver more reliable retrieval and analysis downstream.

Details

  • New parameter: sanitize_content=True by default
  • Disable when exact layout/spacing must be preserved for specialized parsing
  • No breaking changes; review behavior if you depend on raw spacing in existing pipelines