The 5 levels of agentic software: A progressive framework for building reliable AI agents

Cosette Cressler
March 24, 2026
9 min read

The agent is running. The demo went great. The investor call is booked.

Then the real questions start. Why does it forget everything between sessions? Why is it ignoring the style guide your team spent three weeks refining? Why is it repeating research that was already completed, sometimes twice in the same session, driving up token costs that now need to be explained to finance?

The instinct is to blame the model. So you swap it out. Then you rewrite the prompts. Then you tweak the embeddings. And still, the agent behaves as if it is encountering the codebase—and the user—for the very first time, every single time.

Here's what's actually happening: complexity is being added before the foundation is solid.

Many teams jump straight into multi-agent orchestration because it looks impressive in a demo. It feels like progress. What they don't realize until weeks later is that a single agent with good instructions and proper infrastructure would have solved 80% of their problem without the coordination failures, context confusion, or spiraling costs.

An agent's architecture is not just about what it can do. It is about what it knows, what it remembers, and how it evolves over time. Those capabilities do not emerge automatically. They have to be built in, layer by layer, with intention.

This is the idea behind the Five Levels of Agentic Software: a progressive framework for building agents that do not just perform well in demos, but hold up under real-world conditions.

What are the 5 Levels of agentic software and why do they matter?

The Five Levels of Agentic Software is an architectural framework developed by Ashpreet Bedi, founder and CEO of Agno. It outlines how agent capability and complexity should evolve step by step, starting with a stateless LLM equipped with tools and progressing toward a fully autonomous, self-learning, multi-agent system.

The core premise is simple but often ignored: every level adds complexity, and complexity comes with real cost. That cost only makes sense to pay when the simpler approach has clearly hit its ceiling.

What makes the framework valuable is how it shifts the question. Instead of asking what is possible to build, it pushes teams to ask what is actually necessary. In a landscape where multi-agent orchestration, autonomous reasoning loops, and heavy infrastructure are often treated as the default starting point, this perspective offers a useful counterbalance.

By grounding decisions in necessity rather than novelty, the framework helps teams avoid premature complexity. The result is less wasted engineering time, fewer debugging cycles, and agents that work reliably at each stage before additional layers are introduced.

Level 1: What is a stateless AI agent and when is it enough?

At its most basic, an agent is just an LLM with tools. Without tools, it can reason, but it can't act. For a coding agent, the minimum viable toolset is read files, write files, and run shell commands. That's it. That's Level 1.

from agno.agent import Agent
from agno.models.openai import OpenAIResponses
from agno.tools.coding import CodingTools

WORKSPACE = Path(__file__).parent.joinpath("workspace")
WORKSPACE.mkdir(parents=True, exist_ok=True)

agent = Agent(
    name="Gcode",
    model=OpenAIResponses(id="gpt-5.2"),
    instructions=(
        "You are a coding agent. Write clean, well-documented code. "
        "Always save your work to files and test by running them."
    ),
    tools=[CodingTools(base_dir=WORKSPACE, all=True)],
    markdown=True,
)

agent.print_response(
    "Write a Fibonacci function, save it to fib.py, and run it to verify",
    stream=True,
)

What's happening under the hood: the agent receives a task, uses its tool set to write, edit, and execute code, and returns a result. It works. It can solve real problems.

When should you use stateless agents?

Level 1 agents are sufficient for isolated, self-contained tasks. And that covers more ground than most teams think.

What's missing is just as important: every run starts from zero. The agent has no memory of previous sessions, no access to team conventions unless they're pasted into the prompt, and no knowledge beyond what's in the current context window. Level 1 is stateless by design. Everything must be in the context.

Level 2: How do knowledge bases and session storage make AI agents more reliable?

Level 1 is stateless, which means everything has to be provided in the prompt each time. Level 2 addresses this limitation by introducing two foundational capabilities: session storage and domain knowledge.

What does session storage give an AI agent?

Storage gives the agent continuity. Each session is saved to a database, which immediately unlocks two important benefits. First, the agent can access recent chat history and include the last set of interactions in its context window, allowing it to stay grounded in the current conversation. Second, it creates a reliable record of what happened during each session. This makes it possible to trace the agent's actions, decisions, and outputs without relying on external logging or sending sensitive data to third-party providers.

What does a knowledge base give an AI agent?

Knowledge expands what the agent can understand beyond the codebase itself. Most coding agents today only see the files in front of them. They have no visibility into architecture documents, design decisions, internal runbooks, or even the Slack conversations where key choices were explained. As a result, they often make decisions that conflict with months of prior thinking simply because that context is missing.

A knowledge base fills this gap. It provides a structured, searchable store of information that matters but does not need to live in the context window at all times. This can include specifications, architecture decision records, meeting notes, runbooks, and team discussions. Using a combination of semantic and keyword search, the agent can retrieve the most relevant pieces of information at runtime and incorporate them into its context. This is what agentic retrieval looks like in practice, where the agent pulls in the right knowledge only when it is needed.

from agno.db.sqlite import SqliteDb
from agno.knowledge import Knowledge
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.vectordb.chroma import ChromaDb, SearchType

db = SqliteDb(db_file=str(WORKSPACE / "agents.db"))

knowledge = Knowledge(
    vector_db=ChromaDb(
        collection="coding-standards",
        path=str(WORKSPACE / "chromadb"),
        search_type=SearchType.hybrid,
        embedder=OpenAIEmbedder(id="text-embedding-3-small"),
    ),
)

agent = Agent(
    name="Gcode",
    model=OpenAIResponses(id="gpt-5.2"),
    tools=[CodingTools(base_dir=WORKSPACE, all=True)],
    knowledge=knowledge,
    search_knowledge=True,
    db=db,
    add_history_to_context=True,
    num_history_runs=3,
    markdown=True,
)

How do you load content into an AI agent's knowledge base?

Seeding the knowledge base is straightforward—text, PDFs, and URLs can all be inserted directly. At runtime, the agent searches for relevant chunks and pulls them into context automatically:

# Load your coding standards
knowledge.insert(text_content="""
## Project Conventions
- Use type hints on all function signatures
- Write docstrings in Google style
- Prefer list comprehensions over map/filter
""")

What changed

Two additions: Knowledge backed by ChromaDb (with hybrid search for semantic and keyword matching), and SqliteDb for session storage. The result: the agent now searches knowledge before coding, follows standards it wasn't trained on, and maintains continuity across multi-turn conversations. This is basic Agentic RAG in practice: insert documents — text, PDFs, or URLs — into the knowledge base, and at runtime the agent searches for relevant chunks and pulls them into context automatically.

When should you add storage and a knowledge base to your AI agent?

When the agent needs to follow standards that aren't in its training data, or when users expect conversations that pick up where they left off. For most internal tools, this is the sweet spot.

Level 3: How do AI agents learn and improve without fine-tuning?

The jump from Level 2 to Level 3 is the most significant one in the framework. At Level 2, the agent follows rules that are given to it. At Level 3, it learns rules from experience. That distinction is the difference between a system that performs consistently and a system that improves.

from agno.learn import LearnedKnowledgeConfig, LearningMachine, LearningMode
from agno.tools.reasoning import ReasoningTools

learned_knowledge = Knowledge(
    vector_db=ChromaDb(
        collection="coding-learnings",
        path="tmp/chromadb",
        search_type=SearchType.hybrid,
        embedder=OpenAIEmbedder(id="text-embedding-3-small"),
    ),
)

agent = Agent(
    name="Gcode",
    model=OpenAIResponses(id="gpt-5.2"),
    tools=[
        CodingTools(base_dir=WORKSPACE, all=True),
        ReasoningTools(),
    ],
    knowledge=docs_knowledge,
    search_knowledge=True,
    learning=LearningMachine(
        knowledge=learned_knowledge,
        learned_knowledge=LearnedKnowledgeConfig(
            mode=LearningMode.AGENTIC,
        ),
    ),
    enable_agentic_memory=True,
    db=db,
    markdown=True,
)

Agentic memory at this level means the agent can extract facts from conversations, identify patterns across sessions, and synthesize learnings into a persistent memory store that shapes future behavior. Ask it about a user's preferred code style once, and it remembers. Correct a mistake, and it doesn't repeat it. Establish a team convention through usage, and the agent begins to reflect that convention organically.

This is what we call "GPU Poor Continuous Learning" — continuous improvement without fine-tuning, retraining, or any of the infrastructure traditionally required for model updates. The model doesn't get smarter. The system does. And as the underlying models improve, the system benefits automatically.

The two-session test illustrates this clearly.

# Session 1: User teaches a preference
agent.print_response(
    "I prefer functional programming style — no classes, "
    "use pure functions and immutable data. Write a data pipeline.",
    session_id="session_1",
)

# Session 2: New task — agent should apply the preference
agent.print_response(
    "Write a log parser that extracts error counts by category.",
    session_id="session_2",
)

In Session 1, a user specifies a preference for functional programming style—no classes, pure functions, immutable data. In Session 2, given a completely different task, the agent searches its learnings, finds that preference, and writes functional code automatically.

What changed

  • LearningMachine gives the agent save_learning and search_learnings tools. It decides what's worth remembering: coding patterns that worked, mistakes to avoid, user preferences — stored in a separate knowledge base and surfaced in future sessions.
  • ReasoningTools is added, giving the agent the ability to reflect before acting.
  • enable_agentic_memory=True builds a user profile over time: preferred coding style, frameworks in use, how the user likes explanations delivered.

When does your AI agent need to learn from experience?

When the agent serves the same users repeatedly and should improve over time. Personal coding assistants, team tools with shared learnings, any context where "do it the way we like it" matters.

Level 4: When should you use a multi-agent architecture?

Most teams skip straight to this level. That's the mistake the framework is specifically designed to prevent.

Sometimes one agent isn't enough. Level 4 introduces multi-agent teams—splitting responsibilities across specialized agents coordinated by a team leader.

from agno.team.team import Team

coder = Agent(
    name="Coder",
    role="Write code based on requirements",
    tools=[CodingTools(base_dir=WORKSPACE, all=True)],
)

reviewer = Agent(
    name="Reviewer",
    role="Review code for quality, bugs, and best practices",
    tools=[CodingTools(base_dir=WORKSPACE,
                       enable_write_file=False,
                       enable_edit_file=False,
                       enable_run_shell=False)],
)

tester = Agent(
    name="Tester",
    role="Write and run tests for the code",
    tools=[CodingTools(base_dir=WORKSPACE, all=True)],
)

coding_team = Team(
    name="Coding Team",
    members=[coder, reviewer, tester],
    show_members_responses=True,
    markdown=True,
)

The Coder agent writes. The Reviewer reads. The Tester validates. Each is optimized for its domain. Together, they handle tasks that no single agent could manage effectively.

A note of caution: multi-agent teams are powerful but unpredictable. The team leader is an LLM making delegation decisions — sometimes it delegates well, sometimes it doesn't. For production systems where reliability matters, prefer explicit workflows over dynamic teams. Teams shine in human-supervised settings where someone can review the output before it ships.

What changed

  • One agent becomes multiple, each with a defined role, a tailored toolset, and a clean separation of concerns.
  • A Team object coordinates them, routing tasks and aggregating responses.
  • With show_members_responses=True, every agent's output is visible, making the system easier to supervise and debug.

When does a multi-agent system make sense?

When you need multiple perspectives (code review is a perfect example), when tasks naturally decompose into specialist roles, or when you're building interactive tools where a human can supervise the team.

Level 5: How do you deploy an AI agent system to production?

Level 5 is the runtime that turns Levels 1–4 into a production service.

It's the full realization of agentic software: an autonomous, self-organizing system that can plan, adapt, recover from failure, and improve over time without requiring human intervention at every step.

Development databases get swapped for production ones, tracing gets added, and everything is exposed as an API.

from agno.db.postgres import PostgresDb
from agno.vectordb.pgvector import PgVector, SearchType
from agno.os import AgentOS

db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"
db = PostgresDb(db_url=db_url)

knowledge = Knowledge(
    vector_db=PgVector(
        db_url=db_url,
        table_name="coding_knowledge",
        search_type=SearchType.hybrid,
        embedder=OpenAIEmbedder(id="text-embedding-3-small"),
    ),
)

agent_os = AgentOS(
    id="Gcode OS",
    agents=[coding_agent],
    teams=[coding_team],
    config=config_path,
    tracing=True,
)
app = agent_os.get_app()

if __name__ == "__main__":
    agent_os.serve(app="run:app", reload=True)

What changed

  • PostgreSQL + PgVector replaces SQLite + ChromaDb. Real connection pooling, real backups, real concurrent access.
  • AgentOS wraps your agents in a FastAPI application with a built-in web UI, session management, and tracing.
  • Tracing (tracing=True) gives you observability into every tool call, every knowledge search, every delegation decision.

When is your AI agent ready for production?

Level 5 is the right choice when your agent is no longer just yours. Multiple users, uptime requirements, and the need to debug production issues all point here.

How to choose the right level of agentic AI for your use case

Always start at Level 1. An agent without context or memory can still solve real problems. Prove that it does before adding anything.

Level 2 is the sweet spot for most internal tools. Session storage and a knowledge base solve the most common production complaints—context loss and convention blindness—without the overhead of autonomous learning or multi-agent coordination.

The jump from Level 2 to Level 3 is the most consequential. It's where the agent stops following instructions and starts generating them from experience. That shift requires infrastructure, not just code.

Multi-agent architecture (Level 4) is a solution to a specific class of problem, not a default approach. The coordination overhead is real and the debugging is harder. Reserve it for tasks that genuinely require it.

Level 5 is distributed systems engineering with probabilistic reasoning in the execution path. The lessons of reliable distributed systems—durability, isolation, governance, persistence, scale—apply directly. The AI industry is still learning this.

Here's the cookbook with runnable code for all five levels. Clone it, run it, and share what you build.