Handling context window limits in Agno: Token tracking & preventing overflow

Building production-ready AI agents often means discovering issues the hard way. For example, if you’ve built agents, you've probably encountered scenarios where conversations grow longer, tool calls accumulate, and suddenly your agent fails with a "prompt is too long" error. Or worse, your users keep chatting without realizing they're about to hit a wall.

The discovery : context windows have limits. And hitting them mid-conversation is not a good user experience.

This guide will show you how to track token usage in Agno and implement best practices to prevent context window overflow.

Why Token Tracking Matters?
Understanding Agno's Token Metrics
Common Confusion Points
Best Practices for Managing Context Windows
Conclusion

Why Token Tracking Matters?

Context window management is more than just a technical detail. It’s critical for 3 reasons:

Cost Control: Every token costs money. Without tracking, your costs can spiral unpredictably, especially with long-running conversations or agents that make many tool calls.

User Experience: When a conversation exceeds the model's context window, the agent will fail with a cryptic error. Proactive monitoring helps you implement graceful fallback strategies - like automatically summarizing old messages or prompting users to start a new conversation, rather than letting agents crash unexpectedly. This is especially critical if users bring their own API keys where token limits directly affect their usage.

Reliability: Agents that crash mid-task due to exceeding token limits create a poor experience and can lose valuable context. Proactive management prevents these failures.

Lets simplify agno’s token metrics so that you can build with more insights.

Understanding Agno's Token Metrics

Agno provides token usage information through two primary mechanisms: run metrics and session metrics. Understanding the difference is crucial.

Run Metrics

Run metrics represent token usage for a single agent execution. Each run provides the following metrics:

input_tokens: Tokens in the current request (includes system prompt, user message, and conversation history if enabled)
output_tokens: Tokens generated in the response by the model
total_tokens: Sum of input and output tokens for this specific run
cache_write_tokens: Tokens written to cache (for models with prompt caching like Claude)
cache_read_tokens: Tokens read from cache on subsequent requests

Here's how to access run metrics:

from agno.agent import Agent
from agno.models.openai import OpenAIChat

agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    instructions="You are a helpful assistant."
)

response = agent.run("What is the capital of France?")

# Access run metrics
print(f"Input tokens: {response.metrics.input_tokens}")
print(f"Output tokens: {response.metrics.output_tokens}")
print(f"Total tokens: {response.metrics.total_tokens}")

‍

Session Metrics

Session metrics accumulate token usage across an entire conversation. Session metrics provide:

total_tokens: Cumulative sum of all tokens used across all runs in the session
input_tokens: Total input tokens across all runs
output_tokens: Total output tokens across all runs
cache_creation_input_tokens: Total tokens written to cache across the session
cache_read_tokens: Total tokens read from cache across the session

To access session metrics, you need to enable a database and retrieve the session:‍

from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb

# Create agent with database for session tracking
agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    db=SqliteDb(db_file="agno.db"),
    add_history_to_context=True,
    num_history_runs=10
)

# First interaction
response1 = agent.run("Tell me about Python", session_id="user-123")
print(f"Run 1 tokens: {response1.metrics.total_tokens}")

# Continued conversation
response2 = agent.run("What about its history?", session_id="user-123")
print(f"Run 2 tokens: {response2.metrics.total_tokens}")

# Get session metrics using the agent's method
session_metrics = agent.get_session_metrics(session_id="session-123")
if session_metrics:
    print(f"Session total tokens: {session_metrics.total_tokens}")

Common Confusion Points

Why do metrics seem inconsistent?

Several factors can cause this:

Cache Tokens: Models with prompt caching (like Claude) report separate metrics for cached tokens. Your input_tokens may appear lower than expected because some tokens are being read from cache (cache_read_tokens) rather than counted as new input.
History Inclusion: When add_history_to_context=True, input tokens include previous conversation history

Best Practices for Managing Context Windows

1. Set Appropriate History Limits

Don't load the entire conversation history every time. Use num_history_runs to control how many previous exchanges are included:

from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb

agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    db=SqliteDb(db_file="agno.db"),
    add_history_to_context=True,
    num_history_runs=5,  # Only include last 5 exchanges
)

2. Use Session Summaries for Long Conversations

Let Agno automatically summarise the conversation so far and add that to your context instead of granular history:

from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb
from agno.session.summary import SessionSummaryManager

agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    db=SqliteDb(db_file="agno.db"),
    enable_session_summaries=True,
    add_session_summary_to_context=True,
    session_summary_manager=SessionSummaryManager(
        model=OpenAIChat(id="gpt-4o-mini")  # Use cheaper model for summaries
    ),
    num_history_runs=3  # Keep summaries + recent history
)

# The agent will automatically generate summaries to compress old context
response = agent.run("Follow up question...", session_id="user-789")

‍Note: When using session summaries, set a small num_history_runs value since the summary already captures the conversation context.

3. Monitor Token Usage for Multi-Agent Systems

For teams, track token usage at the team level:

from agno.team.team import Team
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb

db = SqliteDb(db_file="agno.db")

team = Team(
    members=[
        Agent(model=OpenAIChat(id="gpt-4o"), db=db),
        Agent(model=OpenAIChat(id="gpt-4o"), db=db),
    ],
    db=db,
)

# Pass session_id when running
response = team.run("Task", session_id="session-123")

# Get team-level session metrics
team_metrics = team.get_session_metrics(session_id="session-123")
if team_metrics:
    print(f"Team total tokens: {team_metrics.total_tokens}")

‍

4. Implement Daily Budget Limits

Protect against runaway costs with custom budget tracking:

from datetime import datetime
from typing import Dict
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb

class DailyBudgetTracker:
    def __init__(self, daily_limit_tokens: int = 1000000):
        self.daily_limit = daily_limit_tokens
        self.usage: Dict[str, int] = {}
    
    def check_budget(self, session_id: str, estimated_tokens: int) -> bool:
        """Check if we're within daily budget"""
        today = datetime.now().date().isoformat()
        key = f"{session_id}:{today}"
        current_usage = self.usage.get(key, 0)
        
        if current_usage + estimated_tokens > self.daily_limit:
            raise Exception(
                f"Daily token limit exceeded: {current_usage:,} / {self.daily_limit:,}"
            )
        return True
    
    def record_usage(self, session_id: str, tokens: int):
        """Record token usage"""
        today = datetime.now().date().isoformat()
        key = f"{session_id}:{today}"
        self.usage[key] = self.usage.get(key, 0) + tokens

# Usage Example
budget_tracker = DailyBudgetTracker(daily_limit_tokens=500000)
agent = Agent(
    model=OpenAIChat(id="gpt-4o"),
    db=SqliteDb(db_file="agno.db")
)

try:
    budget_tracker.check_budget("user-123", estimated_tokens=10000)
    response = agent.run("message", session_id="user-123")
    
    if response.metrics:
        budget_tracker.record_usage("user-123", response.metrics.total_tokens)
    else:
        print("Warning: No metrics available for budget tracking")
except Exception as e:
    print(f"Budget limit reached: {e}")

‍

5. Enable Prompt Caching for Cost Savings

For models that support it, enable prompt caching to reduce costs:

from agno.agent import Agent
from agno.models.anthropic import Claude

agent = Agent(
    model=Claude(
        id="claude-sonnet-4-20250514",
        cache_system_prompt=True  # Cache the system prompt
    ),
    instructions="Your large system prompt here...",
)

# First run creates cache
response1 = agent.run("First query")
if response1.metrics:
    print(f"Cache write tokens: {response1.metrics.cache_creation_input_tokens}")

# Second run uses cache
response2 = agent.run("Second query")
if response2.metrics:
    print(f"Cache read tokens: {response2.metrics.cache_read_tokens}")

‍

6. Monitor Token Usage Proactively

Implement automatic token budget checking to warn before hitting limits:

from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.db.sqlite import SqliteDb

class TokenBudgetAgent:
    def __init__(self, max_tokens: int = 200000, warning_threshold: float = 0.95):
        self.max_tokens = max_tokens
        self.warning_threshold = warning_threshold
        
        self.agent = Agent(
            model=OpenAIChat(id="gpt-4o"),
            db=SqliteDb(db_file="agno.db"),
            add_history_to_context=True,
            num_history_runs=10
        )
    
    def run_with_budget_check(self, message: str, session_id: str):
        """Run agent with automatic token budget checking"""
        response = self.agent.run(message=message, session_id=session_id)
        
        # Get session metrics
        session_metrics = self.agent.get_session_metrics(session_id=session_id)
        
        if session_metrics:
            total_tokens = session_metrics.total_tokens or 0
            percentage_used = (total_tokens / self.max_tokens) * 100
            
            print(f"\n📊 Token Usage: {total_tokens:,} / {self.max_tokens:,} ({percentage_used:.1f}%)")
            
            # Warning at threshold
            if total_tokens > (self.max_tokens * self.warning_threshold):
                print(f"⚠️  WARNING: Approaching context limit! Consider:")
                print(f"   • Summarizing the conversation")
                print(f"   • Starting a new session")
                print(f"   • Reducing num_history_runs")
            
            # Critical threshold (99%)
            if total_tokens > (self.max_tokens * 0.99):
                print(f"🚨 CRITICAL: Context window nearly full!")
                print(f"   • Next message may fail")
                print(f"   • Start a new session immediately")
        
        return response

# Usage
budget_agent = TokenBudgetAgent(max_tokens=200000)
response = budget_agent.run_with_budget_check(
    "Write a comprehensive analysis of machine learning",
    session_id="user-456"
)

‍

Conclusion

Context window management in Agno requires understanding the distinction between run metrics and session metrics, implementing proactive monitoring, and properly configuring extended context windows when available.

Key takeaways:

Always configure a database when you need session tracking and metrics
Pass session_id to every agent.run() call to enable session continuity
Monitor proactively using get_session_metrics() to avoid hitting limits
Use session summaries to compress old history automatically
Limit history with num_history_runs to control context size
Enable prompt caching where available to reduce costs
Implement budget limits to protect against unexpected expenses

By following these best practices, you'll build more reliable agents that provide better user experiences and predictable costs.

‍