Cut inference latency and cost with built-in LLM response caching

We’ve added native caching for model responses across sync, async, and streaming APIs. You can configure TTL and cache storage to accelerate repeated prompts and reduce token spend — without building your own cache layer. This is opt-in and works out of the box across the platform.

Details

Works across all API modes (sync/async/streaming)
Configurable TTL and cache directory for control and portability
Reduces latency for repeated prompts in production and evaluation pipelines

Who this is for: Teams optimizing cost, responsiveness, and throughput for high-volume or repetitive LLM workloads.