v2.4.6
January 28, 2026
Reliable website deduplication with per-page content hashing
WebsiteReader now computes a unique content hash per crawled URL, fixing skip_if_exists for multi-page crawls. This ensures accurate per-page deduplication, reduces redundant ingestion, and saves processing cost during re-crawls.
Details
- Correct per-page deduplication for predictable skip_if_exists behavior
- Fewer unnecessary writes and tokens when re-indexing multi-page sites
- Action required: Clear existing website crawl entries in your knowledge store before re-indexing to avoid duplicates
Who this is for: Teams maintaining search indexes, documentation portals, or knowledge bases sourced from websites.
