Like many great open-source features, this one came straight from the community.
LLM Response Caching seems simple on the surface: cache model responses to avoid repeated API calls. But that simplicity hides a real pain point every LLM developer knows well: the waiting game. Wait for the API. Wait for the response. Wait again when you test the same query for the tenth time.
For teams building complex agent systems, this was more than an inconvenience. It slowed development cycles, burned through API budgets during testing, and made the iterative process of prompt engineering feel like watching paint dry.
Where the story begins: A pattern emerges
Feedback started coming in through our Discord community in late August 2025. And it wasn’t just scattered comments, it was a clear pattern.
“I’d like to know if Agno offers an LLM cache feature similar to LangChain's set_llm_cache. This would be really helpful for me to save costs” — @richardwang, August 26, 2025He wasn't alone. @Joao Graca had been wrestling with tool-level caching, expecting it to prevent repeated API calls for similar queries. But tool caching only works when inputs match exactly, which rarely happens when LLMs generate free-text queries. What developers really needed was caching at the LLM response level.
Then there was @Alex88, who discovered an old pull request for response caching that had been gathering dust. The feature had been attempted before but stalled. With Agno v2.0 on the horizon, the need became more urgent: "It would be incredibly helpful for debugging during the migration."
What we heard from our users
The impact showed up across multiple use cases:
- Slower development cycles: Developers were waiting seconds, sometimes tens of seconds, for responses they had already seen, especially when working with complex multi-agent systems.
- Rising costs: Every iteration, every test run, every debugging session triggered new API calls, even for identical queries.
- Frustrating testing workflows: Teams couldn't ensure consistent responses across test runs, making it harder to validate agent behavior
- Rate limit headaches: Developers were repeatedly hitting provider rate limits during active development, slowing progress even more.
One user summed it up perfectly: "Built-in LLM caching works well enough in production. But a feature like LangChain's set_llm_cache is very helpful during development. It helps save costs and, more importantly, makes development faster, especially for complex agent systems, because it reduces the waiting time from repeated calls."
Plan A: What we thought the feature needed to be
When we responded to richardwang's request with "Agreed. We'll get it done! It is a nice feature!" on August 28th, we knew we had to move fast. But we also knew we had to get it right.
Our design goals were straightforward:
We wanted Response Caching to be automatic and invisible, with no configuration needed.
From the start, we made several key decisions:
- Development-first focus: Unlike provider-specific caching mechanisms (like Anthropic's prompt caching or OpenAI's built-in caching), this feature needed to serve developers during the build phase. We explicitly documented: "Do not use response caching in production for dynamic content or when you need fresh, up-to-date responses."
- Persistence across sessions: The cache couldn't just live in memory. Developers restart their applications constantly, so we needed disk-based caching that survives restarts and persists across development sessions.
- Model-level configuration: Rather than bolting caching onto agents or teams, we knew it belonged at the model level. This way, it would work automatically across all use cases, including single-agent and multi-agent teams, as well as streaming responses.
- Smart cache key generation: We kept it simple and effective: the cache key is generated from the messages sent to the agent. When you send the same question or message to an agent, it triggers a cache hit. This straightforward approach means developers get exactly what they expect: repeated queries return cached responses instantly.
- Zero configuration: We deliberately avoided adding configuration options. No cache TTL to set, no cache directory to specify, no knobs to turn. Just one boolean flag:
cache_response=True. The feature should work perfectly out of the box.
The tradeoffs were clear from the beginning. We prioritized developer experience over production optimization, local disk caching over distributed systems, and simplicity over configurability. We wanted a feature you could enable in five seconds and never think about again.
Plan B: Refinement through collaboration
The story of Agno’s Response Caching shows the power of community-driven development. Although the feature was implemented by our core team, it was shaped heavily by the people in our Discord community who were searching for “LLM cache,” “model cache,” and “caching responses.” Their needs and conversations defined the direction.
The community identified the pattern. It wasn’t a single request. It was richardwang on August 26th, Joao Graca on September 3rd, Alex88 on September 24th, and likely many others in between. Each conversation brought additional clarity to the problem. Each use case revealed a new angle. Together, they painted a complete picture of what developers actually needed.
We built the foundation. Between August 28th (when we committed to building it) and October 29th (when we shipped it in v2.2.2), our team focused on getting the core mechanism right: generating cache keys from messages, storing responses on disk, and handling cache hits and misses transparently.
The community stress-tested the vision. The real-world testing in our Discord channels shaped key decisions. When Joao Graca tried tool-level caching and found it didn't work for his use case, we learned we needed LLM-level caching. When richardwang distinguished between production needs (provider-specific caching) and development needs (local response caching), it helped us sharpen our focus.
Together, we made it better. The conversations didn't stop at "we need caching." They evolved into "we need caching that works during development," "we need caching that persists across restarts," and “we need caching that doesn’t interfere with testing.” Every piece of feedback helped refine what we built.
The Payoff: A perfect fix for a real pain
LLM Response Caching in Agno is simple to enable. Just add the cache_response = True parameter when initializing your model:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
agent = Agent(
model=OpenAIChat(
id="gpt-4o",
cache_response=True # That's it
)
)
# First call - cache miss, calls the API
response = agent.run("What is the capital of France?")
# Second identical call - cache hit, returns instantly
response = agent.run("What is the capital of France?")
That's the entire API. No configuration files, no cache policies, no setup. Just flip the switch and go.
How it works
- Cache key generation: When you make a request, Agno generates a unique key based on the messages sent to your agent
- Cache lookup: Before making an API call, Agno checks whether that key already exists in the local cache
- Cache hit: If the key exists, the response returns immediately with no API call, no waiting, and no cost
- Cache miss: f it doesn’t exist, Agno calls the API and caches the response for next time
The cache lives on disk, persisting across sessions and program restarts. It just works.
It works everywhere
Because caching is configured at the model level, it works automatically with:
Single agents: Cache individual agent responses
agent = Agent(
model=OpenAIChat(id="gpt-4o", cache_response=True)
)
agent.run("Your query") # All runs use caching
Multi-agent teams: Each team member maintains their own cache
from agno.team import Team
researcher = Agent(
model=OpenAIChat(id="gpt-4o", cache_response=True),
name="Researcher"
)
writer = Agent(
model=OpenAIChat(id="gpt-4o", cache_response=True),
name="Writer"
)
team = Team(
members=[researcher, writer],
model=OpenAIChat(id="gpt-4o", cache_response=True)
)
Streaming responses: Cache hits return full responses as single chunks
agent = Agent(
model=OpenAIChat(id="gpt-4o", cache_response=True)
)
# First run streams from API
agent.print_response("Tell me a story", stream=True)
# Second run returns cached response instantly
agent.print_response("Tell me a story", stream=True)
The integration with Agno's broader ecosystem is seamless. Response Caching doesn't interfere with tool calling, memory, structured outputs, or any other Agno feature. It's truly invisible until you need it.
What we learned along the way
Building Response Caching taught us something fundamental about open-source development: the best features emerge from real pain points, not abstract planning.
We didn't wake up one day and decide Agno needed response caching. Our community told us. They showed us the problem through Discord messages, GitHub issues, and candid feedback about their development workflows. richardwang's message wasn't a feature request — it was a description of friction he experienced every single day.
The development vs. production distinction mattered more than we expected. We initially thought of caching as a universal optimization. But the community helped us see that developer experience during the build phase is fundamentally different from production needs. Developers need deterministic, repeatable responses for testing. Production systems require fresh, dynamic responses. By focusing on development, we could make better tradeoffs.
Simple APIs hide complex decisions. The final API (cache_response=True) looks trivial. But underneath, we had to make dozens of decisions about cache key generation (we ultimately chose message-based keys for predictability), storage formats (disk-based for persistence), and edge-case handling (like how streaming responses should behave with caching). The decision to avoid configuration options was deliberate and gave developers one less thing to think about.
Disk-based caching was the right call. Some team members initially pushed for in-memory caching for speed. But developers restart their applications constantly. Persisting the cache across sessions turned out to be far more valuable than microseconds of speed improvement.
Here's what surprised us:
"The best part about building in open source is watching a feature go from a Discord message to production in just two months. Response Caching proves that when you listen to your community, you build exactly what developers need — not what you think they need."
This experience reinforced something important: Agno’s values of collaboration, transparency, and experimentation are not aspirational. They guide how we actually build. Response Caching exists because our community openly shared their struggles, contributors actively tested early ideas, and we willingly iterated based on real feedback.
The results: Real-world impact
The numbers tell the story. When we ran our cookbook examples to demonstrate Response Caching, the results were dramatic:
Run 1: Cache Miss (First Request)
- Elapsed time: 15.206s
Run 2: Cache Hit (Cached Response)
- Elapsed time: 0.008s
That's not a typo. The second run was 1,800x faster, dropping from more than 15 seconds to 8 milliseconds.. During active development, when you're iterating on prompts, testing edge cases, or debugging multi-agent interactions, those seconds add up to hours of saved time.
But speed is only part of the benefit. Every cache hit is also:
- Zero API cost — no tokens consumed and no charges incurred
- Zero rate limit impact — easier to stay under provider limits during intensive testing
- Perfect consistency — identical queries return identical responses for reliable testing
For teams building complex agent systems, Response Caching has transformed the development experience. Instead of waiting for API responses during every test run, developers can iterate rapidly on their agent logic, test edge cases thoroughly, and debug issues without burning through their API budgets.
The feature works exactly as developers expect: turn it on once, and it invisibly accelerates every repeated query throughout your development workflow.
Try it, test it, then tell us what’s next
Building Agno’s Response Caching was a team effort in the truest sense. We want to thank:
- @richardwang for clearly articulating the problem and pushing us to prioritize this
- @Joao Graca for helping us understand why tool-level caching wasn't enough
- @Alex88 for surfacing the old PR and advocating for the feature during v2.0 migration
- Agno’s core team for committing to shipping it quickly
- Everyone in our Discord community who shared their experiences and helped us understand what developers really need
Want to try Response Caching?
- Check out the documentation
- See working examples in our examples gallery
- Start building with Agno using our quickstart guide
Join the conversation:
- Join us on Discord!
- Star us on GitHub and open issues or PRs
- Share your Agno projects with the community
What should we build next?
Response Caching solved one pain point, but we know there are more. What feature would make your Agno development experience better? Join our Discord and let us know!
Agno’s next feature may start with your message. We’re listening.


