Agent Building for Software Engineers: It’s Mostly a While Loop


An AI agent is a while loop wrapped around a model call. That’s it. That’s the secret.

Everything else—tools, memory, planning, “reasoning”—is implementation detail layered on top of that loop. Once you see it, the mystique evaporates and you’re left with something far more useful: a system you can actually reason about, debug, and build.

This post is for engineers who’ve used coding agents like Claude Code and now want to build their own. Not the framework-tutorial version. The mental-model version—what an agent is as a piece of software, where the hard parts actually live, and why your normal engineering instincts will both help and betray you.

Here’s a complete agent. Not pseudocode—the actual shape of every agent ever built:

def agent(task, tools):
    messages = [{"role": "user", "content": task}]

    while True:
        response = model.call(messages, tools=tools)
        messages.append(response)

        if response.tool_calls:
            for call in response.tool_calls:
                result = tools[call.name](**call.args)
                messages.append({"role": "tool", "content": result})
        else:
            return response.content   # model is done, no more tools

That’s the engine. A loop that calls a model, checks whether the model wants to use a tool, runs the tool if so, feeds the result back, and repeats until the model stops asking for tools.

Strip it to a sentence: the model decides what to do, you execute it, you tell it what happened, and you let it decide again.

If you’ve written a REPL, an event loop, or a state machine, this is familiar territory. The novelty isn’t the control flow. It’s that the branch predictor is a language model, and it’s non-deterministic. Hold that thought—it’s the source of every hard problem later.

Almost everything you’ve built talks to a service the same way: request in, response out. You call an endpoint, you get an answer, you’re done. Stateless, one round-trip, and you wrote every branch that decided what happened next.

Request/response (everything you know):

  request ──▶ [ service ] ──▶ response      (done)

  One call. Stateless. You own the control flow.

Agentic loop (the new shape):

  task ──▶ [ model ] ──▶ tool call ──▶ [ you run it ] ──┐
             ▲                                          │
             └──────────────── result ─────────────────┘
             (repeat until the model says "done")

  Many calls. Stateful. The model owns the control flow.

The difference that matters isn’t the number of round-trips. It’s who decides what happens next. In request/response, you did—you wrote the if statements. In an agentic loop, the model does.

An analogy. Conventional programming is telling a system: “if the room is dark, turn on the light.” You inspected the condition, you chose the action, you wrote the branch. Deterministic and total—every path is one you anticipated.

An agent is different. You hand the model a goal—“make the room brighter”—and a set of tools: turn on the light, lift the blinds, open the curtains. Then you let it decide which to use, in what order, and when it’s bright enough to stop. You didn’t write the branch. You delegated the decision.

That delegation is the entire point. It’s why an agent can handle tasks you couldn’t enumerate in advance—and it’s also the source of every hard problem in this post. The moment the model owns the control flow, you trade the guarantees of determinism for the flexibility of judgment. Everything that follows is managing that trade.

The word “tool” makes it sound special. It isn’t. A tool is a function you expose to the model, described well enough that the model knows when and how to call it.

{
    "name": "get_weather",
    "description": "Get current weather for a city.",
    "parameters": {
        "city": {"type": "string", "description": "City name, e.g. 'Tokyo'"}
    }
}

The model never runs your code. It emits a structured request—“I’d like to call get_weather with city='Tokyo'"—and you execute that in your runtime and hand back the result. The model is the planner; your code is the hands.

This means tool building is API design, a discipline you already have. The same rules apply, just with a non-human consumer:

Good tool design:
  - One tool, one job (get_weather, not do_everything)
  - Descriptive names and parameter docs (the model reads these)
  - Return structured, parseable results
  - Fail loudly with useful error messages

Bad tool design:
  - Overloaded tools with 15 optional parameters
  - Vague descriptions ("processes data")
  - Returning raw stack traces or 10,000-line blobs
  - Silent failures the model can't detect

The model reads your descriptions the way a new engineer reads your API docs—except it has no Slack to ask follow-ups. The quality of your descriptions is the quality of your tool calls. Ambiguity in the docstring shows up as the model calling the wrong tool with the wrong arguments.

One rule that matters more than the rest: error messages are prompts. When a tool fails, the model reads the error and decides what to do next. “Error: 500” tells it nothing. “Error: city ‘Tokoy’ not found, did you mean ‘Tokyo’?” lets it self-correct on the next loop iteration. Your error strings are part of the agent’s reasoning surface.

Here’s where engineers get blindsided. You’re used to systems with effectively unlimited memory and perfect recall. An agent has neither.

Everything the agent “knows” in the moment lives in the context window—the running list of messages you pass back on every model call. The task, the conversation, every tool result, every intermediate thought. All of it competes for a fixed token budget.

Context window = working memory

  [system prompt        ]  ← who the agent is, always present
  [task                 ]  ← what you asked
  [tool call + result   ]  ← grows every iteration
  [tool call + result   ]
  [tool call + result   ]  ← ...and grows
  [tool call + result   ]
  ───────────────────────
  Eventually: full. Now what?

Every loop iteration adds to this. A long-running agent—say, one debugging across dozens of files—will blow through its budget. When it fills, you have three options, and managing them is the actual job of agent engineering:

Truncation. Drop the oldest messages. Cheap, simple, and the agent forgets what it was doing. Works for stateless tasks, fails for anything requiring continuity.

Compaction. Summarize old turns into a compressed note and replace the raw history. This is what serious agents do. (If you’ve seen a coding agent say “compacting conversation,” this is it.) You trade fidelity for space—the summary keeps the gist and loses the detail.

Retrieval. Keep the bulk of state outside the context in a store, and pull in only what’s relevant for the current step. The agent’s memory becomes a database query, not a scroll-back.

This discipline—deciding what enters the context, what gets summarized, and what gets fetched on demand—is context engineering, and it’s where most of your design effort will actually go. Prompt wording is the part everyone talks about; context management is the part that determines whether the agent works.

The counterintuitive bit: more context is not better. A bloated context window degrades model performance—relevant details get buried under transcript noise, and the model starts attending to the wrong things. People call it “context rot.” Curating down often beats stuffing in. Treat tokens like a memory budget in an embedded system: every byte you spend on history is a byte you can’t spend on thinking.

Once you accept the context window as RAM, the next question answers itself: you need disk too.

Short-term memory  =  the context window
                      (this session, this task, ephemeral)

Long-term memory   =  external store + retrieval
                      (across sessions, persistent, queried)

Short-term memory is just the message list—it dies when the session ends. For anything that needs to persist (user preferences, facts learned last week, accumulated project knowledge), you write to an external store and retrieve relevant pieces back into context when needed.

The retrieval can be as simple as a key-value lookup or as involved as semantic search over embeddings. Start simple. A surprising number of “memory” requirements are satisfied by a plain database keyed on user ID, not a vector store. Reach for embeddings when you actually need fuzzy recall over unstructured history—not before. The vector database is a tool, not a rite of passage.

Every engineering instinct you have was built on deterministic systems. Same input, same output. Stack traces that point at the line. Bugs that reproduce.

Conventional programs earn that determinism by construction: the same inputs walk the same branches to the same result, every run. That’s the contract the dark-room if statement gives you—you wrote the path, so you can predict it. An agent breaks the contract on purpose. You handed the control flow to a model, and the model is non-deterministic: identical inputs can produce different outputs across runs. This isn’t a bug you can fix. It’s the substrate. And it has consequences that ripple through your whole design:

Testing changes completely. You can’t assert output == expected when the output is phrased differently every time. A passing run doesn’t guarantee the next one passes. Your test suite is now a distribution, not a checkmark.

Retries are a real strategy. In deterministic code, retrying the same operation gives the same failure. In agent land, retrying can genuinely work, because the model rolls the dice again. Sometimes the fix is a loop with three attempts.

Idempotency matters more. If the model might call your send_email tool twice—and across enough runs it will—that tool had better be safe to call twice. Design tools assuming they may fire more than once. The non-determinism that helps you on retries hurts you on side effects.

Reproduction is painful. “It deleted the wrong file” might not reproduce on the next run. Log everything—full message history, every tool call, every argument—because the transcript is the only forensic trail you get. You cannot re-run your way back to the bug.

This is the deepest adjustment for engineers. You’re no longer building a machine that does the same thing every time. You’re building a system that does a reasonable thing most of the time, and you engineer the guardrails so the unreasonable tail is survivable.

The 15-line loop works in a demo. Production is where the loop meets reality. The failure modes are specific and, once you know them, predictable.

The runaway loop:
  Model calls tool → gets confused by result → calls again →
  same confusion → calls again → ... → $400 in API costs

  Fix: hard cap on iterations. Always.

The infinite loop. Nothing in the basic loop says “stop after N steps.” A confused agent will cheerfully call tools forever. Always cap iterations. Always set a budget. This is the first guardrail you add and the one you’ll be gladdest you have.

Hallucinated tool arguments. The model calls a real tool with invented parameters—a file path that doesn’t exist, an ID it made up. Validate arguments before executing, and return a useful error so the model can recover on the next turn rather than charging ahead.

Context exhaustion mid-task. The agent fills its window halfway through a long job and loses the thread. This is the compaction/retrieval problem above, and it’s why you design for it before you need it, not after the agent face-plants.

Cascading errors. One bad tool result poisons every subsequent decision. The agent builds an entire plan on a misread, and each step compounds the original mistake. Early validation and checkpoints contain the blast radius.

Silent wrongness. The worst one. The agent confidently does the wrong thing and reports success. No error, no crash—just a plausible, incorrect result. This is why you don’t let an unsupervised agent touch anything destructive, and why evals exist.

You cannot unit-test an agent the normal way—the output varies every run. So you do the next best thing: you build an eval set and measure behavior across many cases, accepting a pass rate instead of a pass/fail.

Traditional test:
  assert add(2, 2) == 4        # deterministic, binary

Agent eval:
  Run agent on 50 tasks →
  Did it complete each? → Score
  Track pass rate over time → Catch regressions

An eval is a task plus a way to judge the result. The judge might be an exact check (“did the file get created?”), a rubric, or—increasingly—another model grading the output against criteria. You assemble a representative set of tasks, run the agent against all of them, and watch the aggregate score.

This matters because of non-determinism and because the thing you’re really testing is the system around the model: your tools, your context management, your prompts. When you change a tool description or tweak the system prompt, the eval set tells you whether you helped or quietly broke something three tasks over. Without evals, you’re shipping changes to a non-deterministic system on vibes. With them, you have a regression signal. It’s the closest thing to CI that agentic systems have, and it’s not optional past toy scale.

The agent loop will do what the model decides. So the engineering question is: what are you comfortable letting it decide unsupervised?

Permission tiers:

  Auto-approve:     read a file, run a query, search the web
                    (reversible, low blast radius)

  Ask first:        write a file, send a message, spend money
                    (side effects, hard to undo)

  Never autonomous: delete production data, deploy, wire money
                    (catastrophic, irreversible)

The pattern is to gate tools by reversibility. Reading is safe—let it run. Writing has consequences—maybe confirm. Anything irreversible or expensive gets a human in the loop, full stop. This is just the principle of least privilege applied to a non-deterministic actor.

Two more guardrails worth building in from the start:

Sandboxing. If the agent executes code or commands, run it somewhere contained—a container, a VM, a scratch directory—not on the host that matters. Assume it will eventually do something you didn’t anticipate, and make that survivable.

Human-in-the-loop checkpoints. For consequential multi-step work, insert a “show me the plan before executing” gate. Cheap insurance against an agent confidently marching off a cliff. The check-in is far cheaper than the cleanup.

You just read the whole engine in fifteen lines, and understanding it is the point—but in production you usually won’t hand-roll it. There’s a worthwhile story in why.

Both major labs built a coding agent first. OpenAI shipped Codex; Anthropic shipped Claude Code. Each was a terminal-driven agent aimed squarely at software tasks—read files, run commands, edit code, check the result. And in building them, both teams ran into the same realization: the harness underneath wasn’t coding-specific. The loop, the tool execution, the context management, the permission gating—none of it cared that the domain was code. It was a general agent runtime that happened to be pointed at a codebase.

So they generalized it. The patterns from Codex and Claude Code got lifted out into agent SDKs—OpenAI’s Agents SDK and Anthropic’s Claude Agent SDK (renamed from the Claude Code SDK precisely to signal it’s no longer just about code). These hand you the primitives this post described as first-class pieces: the loop, tool/function calling, multi-agent handoffs, sessions and state, guardrails, and tracing—so you configure them instead of reimplementing them.

The progression:

  Coding agent          →    General SDK
  (Codex, Claude Code)       (Agents SDK, Claude Agent SDK)

  proving ground             the same loop, generalized

The direction of travel is the thing to notice: the coding agent was the proving ground, and the SDK is the generalization. The same instincts from agentic coding—delegate clearly, expose good tools, manage context, gate the dangerous actions—are exactly what these toolkits encode. They aren’t magic. They’re the fifteen-line loop with the hard parts (retries, streaming, context compaction, permissioning) already solved and battle-tested.

One caveat, since this is the fastest-moving corner of the field: the specifics—names, APIs, which SDK does what—will drift. The loop underneath won’t. That’s the whole reason to learn it from the bottom up. Once you understand the engine, an SDK is a convenience that saves you from rebuilding plumbing—not a black box you’re forced to trust.

The most important engineering judgment here is restraint. Agents are the answer to a specific shape of problem, and that shape is narrower than the hype suggests.

Don't build an agent when:

  The task is fixed and known    →  write a script
  One model call would do it      →  just call the model
  You need guaranteed output      →  agents are non-deterministic
  Latency is critical             →  loops are slow (many round-trips)
  Every step needs verification   →  the agent saves you nothing

If the steps are known in advance, you don’t need a model deciding them at runtime—you need a function. If a single prompt answers the question, the loop is pure overhead. Agents earn their complexity only when the path genuinely can’t be predicted ahead of time: when the next step truly depends on what the previous step returned, and you couldn’t have written the branch yourself.

The failure mode of the moment is agentifying things that should be three lines of Python. A for loop with a model call inside is not an agent, and it’s usually the better design. Reach for the agent loop when the control flow itself needs to be decided by the model—the dark room where you don’t know in advance whether the fix is the light, the blinds, or the curtains. Otherwise you’re paying latency, cost, and non-determinism for flexibility you don’t need.

Here’s the architecture once the 15-line loop grows up:

        ┌─────────────────────────────────┐
        │         The Agent Loop          │
        │   (decide → act → observe)      │
        └─────────────────────────────────┘
                  │            │
        ┌─────────┘            └─────────┐
        ▼                                ▼
  ┌──────────┐                    ┌─────────────┐
  │  Tools   │                    │   Context   │
  │(functions│                    │  Management │
  │ + schema)│                    │(compact/RAG)│
  └──────────┘                    └─────────────┘
        │                                │
        ▼                                ▼
  ┌──────────┐                    ┌─────────────┐
  │Guardrails│                    │   Memory    │
  │(perms +  │                    │ (external   │
  │ sandbox) │                    │   store)    │
  └──────────┘                    └─────────────┘
                  │            │
                  ▼            ▼
            ┌─────────────────────┐
            │       Evals         │
            │ (measure behavior)  │
            └─────────────────────┘

Every box is something you already know how to build. Tools are APIs. Context management is caching and state. Guardrails are permissions and sandboxing. Memory is a datastore with retrieval. Evals are tests, reshaped for a probabilistic world. The agent loop is a state machine. And if you don’t want to assemble the boxes yourself, an SDK hands you most of them pre-wired.

The only genuinely new thing is the model in the middle, making non-deterministic decisions. Everything around it is software engineering—the same discipline you already practice, pointed at an unusual core.

Concept What it really is The hard part
Agent loop A while loop around a model call Knowing when to stop
Request/response vs loop You own control flow vs the model does Delegating the decision
Tools Functions with a schema Descriptions and error messages are prompts
Context window Working memory (RAM) Fixed budget; manage what’s in it
Memory External store + retrieval Don’t over-engineer; a DB often suffices
Non-determinism The substrate, not a bug Breaks testing, retries, reproduction
Failure modes Loops, hallucinated args, cascades Silent wrongness is the worst
Evals Tests for probabilistic systems Pass rate, not pass/fail
Guardrails Least privilege + sandboxing Gate by reversibility
SDKs The coding-agent loop, generalized Learn the loop first; specifics drift
When to build Only when control flow is unknown Restraint; most tasks aren’t agents

The mental model that makes all of this tractable:

  1. An agent is a loop: decide, act, observe, repeat
  2. It’s not request/response—you delegate control flow to the model
  3. Tools are functions you expose with good descriptions
  4. The context window is finite working memory you must curate
  5. You traded determinism for judgment, and that reshapes everything downstream
  6. You test with evals, not assertions
  7. You contain risk with permissions and sandboxes
  8. You don’t have to write the loop yourself—but you should understand it before reaching for an SDK
  9. You don’t build an agent at all unless the path is genuinely unknown

Agent building isn’t a new discipline you have to learn from zero. It’s your existing discipline—API design, state management, error handling, testing, least privilege—rearranged around a probabilistic core. The engineers who build good agents aren’t the ones who memorized a framework. They’re the ones who understood the loop, respected the non-determinism, and engineered the guardrails that make a fallible model safe to delegate to.

It really is mostly a while loop. The craft is in everything you wrap around it.