Building production-ready AI agents often means discovering issues the hard way. For example, if you’ve built agents, you've probably encountered scenarios where conversations grow longer, tool calls accumulate, and suddenly your agent fails with a "prompt is too long" error. Or worse, your users keep chatting without realizing they're about to hit a wall.
The discovery : context windows have limits. And hitting them mid-conversation is not a good user experience.
This guide will show you how to track token usage in Agno and implement best practices to prevent context window overflow.
- Why Token Tracking Matters?
- Understanding Agno's Token Metrics
- Common Confusion Points
- Best Practices for Managing Context Windows
- Conclusion
Why Token Tracking Matters?
Context window management is more than just a technical detail. It’s critical for 3 reasons:
Cost Control: Every token costs money. Without tracking, your costs can spiral unpredictably, especially with long-running conversations or agents that make many tool calls.
User Experience: When a conversation exceeds the model's context window, the agent will fail with a cryptic error. Proactive monitoring helps you implement graceful fallback strategies - like automatically summarizing old messages or prompting users to start a new conversation, rather than letting agents crash unexpectedly. This is especially critical if users bring their own API keys where token limits directly affect their usage.
Reliability: Agents that crash mid-task due to exceeding token limits create a poor experience and can lose valuable context. Proactive management prevents these failures.
Lets simplify agno’s token metrics so that you can build with more insights.
Understanding Agno's Token Metrics
Agno provides token usage information through two primary mechanisms: run metrics and session metrics. Understanding the difference is crucial.
Run Metrics
Run metrics represent token usage for a single agent execution. Each run provides the following metrics:
input_tokens: Tokens in the current request (includes system prompt, user message, and conversation history if enabled)output_tokens: Tokens generated in the response by the modeltotal_tokens: Sum of input and output tokens for this specific runcache_write_tokens: Tokens written to cache (for models with prompt caching like Claude)cache_read_tokens: Tokens read from cache on subsequent requests
Here's how to access run metrics:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
agent = Agent(
model=OpenAIChat(id="gpt-4o"),
instructions="You are a helpful assistant."
)
response = agent.run("What is the capital of France?")
# Access run metrics
print(f"Input tokens: {response.metrics.input_tokens}")
print(f"Output tokens: {response.metrics.output_tokens}")
print(f"Total tokens: {response.metrics.total_tokens}")
Session Metrics
Session metrics accumulate token usage across an entire conversation. Session metrics provide:
total_tokens: Cumulative sum of all tokens used across all runs in the sessioninput_tokens: Total input tokens across all runsoutput_tokens: Total output tokens across all runscache_creation_input_tokens: Total tokens written to cache across the sessioncache_read_tokens: Total tokens read from cache across the session
To access session metrics, you need to enable a database and retrieve the session:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb
# Create agent with database for session tracking
agent = Agent(
model=OpenAIChat(id="gpt-4o"),
db=SqliteDb(db_file="agno.db"),
add_history_to_context=True,
num_history_runs=10
)
# First interaction
response1 = agent.run("Tell me about Python", session_id="user-123")
print(f"Run 1 tokens: {response1.metrics.total_tokens}")
# Continued conversation
response2 = agent.run("What about its history?", session_id="user-123")
print(f"Run 2 tokens: {response2.metrics.total_tokens}")
# Get session metrics using the agent's method
session_metrics = agent.get_session_metrics(session_id="session-123")
if session_metrics:
print(f"Session total tokens: {session_metrics.total_tokens}")Common Confusion Points
Why do metrics seem inconsistent?
Several factors can cause this:
- Cache Tokens: Models with prompt caching (like Claude) report separate metrics for cached tokens. Your
input_tokensmay appear lower than expected because some tokens are being read from cache (cache_read_tokens) rather than counted as new input. - History Inclusion: When
add_history_to_context=True, input tokens include previous conversation history
Best Practices for Managing Context Windows
1. Set Appropriate History Limits
Don't load the entire conversation history every time. Use num_history_runs to control how many previous exchanges are included:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb
agent = Agent(
model=OpenAIChat(id="gpt-4o"),
db=SqliteDb(db_file="agno.db"),
add_history_to_context=True,
num_history_runs=5, # Only include last 5 exchanges
)2. Use Session Summaries for Long Conversations
Let Agno automatically summarise the conversation so far and add that to your context instead of granular history:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb
from agno.session.summary import SessionSummaryManager
agent = Agent(
model=OpenAIChat(id="gpt-4o"),
db=SqliteDb(db_file="agno.db"),
enable_session_summaries=True,
add_session_summary_to_context=True,
session_summary_manager=SessionSummaryManager(
model=OpenAIChat(id="gpt-4o-mini") # Use cheaper model for summaries
),
num_history_runs=3 # Keep summaries + recent history
)
# The agent will automatically generate summaries to compress old context
response = agent.run("Follow up question...", session_id="user-789")
Note: When using session summaries, set a small num_history_runs value since the summary already captures the conversation context.
3. Monitor Token Usage for Multi-Agent Systems
For teams, track token usage at the team level:
from agno.team.team import Team
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb
db = SqliteDb(db_file="agno.db")
team = Team(
members=[
Agent(model=OpenAIChat(id="gpt-4o"), db=db),
Agent(model=OpenAIChat(id="gpt-4o"), db=db),
],
db=db,
)
# Pass session_id when running
response = team.run("Task", session_id="session-123")
# Get team-level session metrics
team_metrics = team.get_session_metrics(session_id="session-123")
if team_metrics:
print(f"Team total tokens: {team_metrics.total_tokens}")
4. Implement Daily Budget Limits
Protect against runaway costs with custom budget tracking:
from datetime import datetime
from typing import Dict
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb
class DailyBudgetTracker:
def __init__(self, daily_limit_tokens: int = 1000000):
self.daily_limit = daily_limit_tokens
self.usage: Dict[str, int] = {}
def check_budget(self, session_id: str, estimated_tokens: int) -> bool:
"""Check if we're within daily budget"""
today = datetime.now().date().isoformat()
key = f"{session_id}:{today}"
current_usage = self.usage.get(key, 0)
if current_usage + estimated_tokens > self.daily_limit:
raise Exception(
f"Daily token limit exceeded: {current_usage:,} / {self.daily_limit:,}"
)
return True
def record_usage(self, session_id: str, tokens: int):
"""Record token usage"""
today = datetime.now().date().isoformat()
key = f"{session_id}:{today}"
self.usage[key] = self.usage.get(key, 0) + tokens
# Usage Example
budget_tracker = DailyBudgetTracker(daily_limit_tokens=500000)
agent = Agent(
model=OpenAIChat(id="gpt-4o"),
db=SqliteDb(db_file="agno.db")
)
try:
budget_tracker.check_budget("user-123", estimated_tokens=10000)
response = agent.run("message", session_id="user-123")
if response.metrics:
budget_tracker.record_usage("user-123", response.metrics.total_tokens)
else:
print("Warning: No metrics available for budget tracking")
except Exception as e:
print(f"Budget limit reached: {e}")
5. Enable Prompt Caching for Cost Savings
For models that support it, enable prompt caching to reduce costs:
from agno.agent import Agent
from agno.models.anthropic import Claude
agent = Agent(
model=Claude(
id="claude-sonnet-4-20250514",
cache_system_prompt=True # Cache the system prompt
),
instructions="Your large system prompt here...",
)
# First run creates cache
response1 = agent.run("First query")
if response1.metrics:
print(f"Cache write tokens: {response1.metrics.cache_creation_input_tokens}")
# Second run uses cache
response2 = agent.run("Second query")
if response2.metrics:
print(f"Cache read tokens: {response2.metrics.cache_read_tokens}")
6. Monitor Token Usage Proactively
Implement automatic token budget checking to warn before hitting limits:
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb
class TokenBudgetAgent:
def __init__(self, max_tokens: int = 200000, warning_threshold: float = 0.95):
self.max_tokens = max_tokens
self.warning_threshold = warning_threshold
self.agent = Agent(
model=OpenAIChat(id="gpt-4o"),
db=SqliteDb(db_file="agno.db"),
add_history_to_context=True,
num_history_runs=10
)
def run_with_budget_check(self, message: str, session_id: str):
"""Run agent with automatic token budget checking"""
response = self.agent.run(message=message, session_id=session_id)
# Get session metrics
session_metrics = self.agent.get_session_metrics(session_id=session_id)
if session_metrics:
total_tokens = session_metrics.total_tokens or 0
percentage_used = (total_tokens / self.max_tokens) * 100
print(f"\n📊 Token Usage: {total_tokens:,} / {self.max_tokens:,} ({percentage_used:.1f}%)")
# Warning at threshold
if total_tokens > (self.max_tokens * self.warning_threshold):
print(f"⚠️ WARNING: Approaching context limit! Consider:")
print(f" • Summarizing the conversation")
print(f" • Starting a new session")
print(f" • Reducing num_history_runs")
# Critical threshold (99%)
if total_tokens > (self.max_tokens * 0.99):
print(f"🚨 CRITICAL: Context window nearly full!")
print(f" • Next message may fail")
print(f" • Start a new session immediately")
return response
# Usage
budget_agent = TokenBudgetAgent(max_tokens=200000)
response = budget_agent.run_with_budget_check(
"Write a comprehensive analysis of machine learning",
session_id="user-456"
)
Conclusion
Context window management in Agno requires understanding the distinction between run metrics and session metrics, implementing proactive monitoring, and properly configuring extended context windows when available.
Key takeaways:
- Always configure a database when you need session tracking and metrics
- Pass
session_idto everyagent.run()call to enable session continuity - Monitor proactively using
get_session_metrics()to avoid hitting limits - Use session summaries to compress old history automatically
- Limit history with
num_history_runsto control context size - Enable prompt caching where available to reduce costs
- Implement budget limits to protect against unexpected expenses
By following these best practices, you'll build more reliable agents that provide better user experiences and predictable costs.


