Breaking change: website crawling uses per-page content hashes

We changed the WebsiteReader deduplication model to compute content hashes per page. This aligns skip_if_exists with page-level updates and ensures accurate re-crawls.

‍

Details:

Behavior change: Deduplication occurs at page granularity, not aggregate level
Action required: Clear existing website crawl entries before re-indexing to prevent duplicates
Benefits: Higher correctness, predictable re-crawls, and lower operational overhead

‍

Who this is for: Engineering teams managing recurring website crawls and large content refreshes.