Asher Cohen
Back to posts

Building Token-Efficient AI Agents in the Real World

Why most agent systems overspend tokens, and a practical pipeline template to make them cheaper, faster, and more reliable.

Building Token-Efficient AI Agents in the Real World

I hit a wall with AI agents the way most engineers do: not because they couldn’t solve problems, but because they were too expensive to run. Tokens disappeared quickly—especially with workflows involving refactoring, testing, dependency upgrades, and orchestrated agents.

At first, the instinct was to optimize prompts, tweak models, or add more structure. None of that moved the needle meaningfully. The real issue wasn’t intelligence or capability—it was how work was structured.

The solution wasn’t making agents smarter. It was making them disciplined.


The core problem

Most agent systems fail for the same reason:

«They optimize for autonomy instead of efficiency.»

A typical setup includes multiple specialized agents (coder, tester, auditor, upgrader), orchestrators delegating tasks between them, recursive calls, and shared or repeated context. This mirrors how human teams operate, so it feels intuitive to replicate.

But unlike humans, LLMs don’t retain memory between calls. Every interaction is stateless. That means every agent must reconstruct context from scratch—re-reading files, reinterpreting goals, and re-deriving decisions.

What looks like “collaboration” is actually repeated recomputation.

In practice, this turns a linear problem into a multiplicative one. Instead of doing work once, the system does it several times across different agents. The illusion of parallel intelligence hides a very real cost: exponential token growth.


Where tokens actually get burned

Three patterns consistently dominate token usage.

1. Context duplication

Each agent independently reprocesses the same information: files, goals, constraints, and history. Even if nothing changes, that context is resent and reinterpreted.

For example, if your orchestrator passes a 3k-token codebase summary to a coder, tester, and auditor, you’ve already spent ~9k tokens before any meaningful work happens. If those agents loop or retry, that number compounds quickly.

This is particularly wasteful because most of that context is irrelevant to each specific task. A test-writing agent doesn’t need the full implementation—just the interface and behavior.

The insight here is simple: context should be scoped to the task, not the system.


2. Recursive delegation

When agents call other agents, you introduce multiplicative complexity. A single task can expand into multiple sub-tasks, each spawning further calls.

Consider a dependency upgrade:

  • orchestrator calls upgrader
  • upgrader calls tester
  • tester fails → calls analyzer
  • analyzer suggests fix → calls coder
  • coder modifies → calls tester again

You’ve now created a loop where each step reloads context and re-evaluates the problem. Even if each call is “small,” the total cost grows rapidly.

This is the most dangerous pattern because it’s often invisible. The system appears to be “working hard,” but it’s mostly revisiting the same decisions in slightly different forms.


3. Multi-purpose agents

Agents that analyze, decide, execute, and validate are effectively running internal workflows. They are not atomic—they are compressed pipelines.

For instance, a “coder” agent that:

  • reads code
  • identifies issues
  • decides on a fix
  • implements changes
  • validates output

…is performing multiple reasoning steps in one call. This increases both input size (more context needed) and output size (more explanations and intermediate reasoning).

Worse, if the result is unsatisfactory, the entire process repeats, duplicating all internal steps again.

The key issue is that multi-purpose agents hide loops inside a single call, making them harder to control and optimize.


The shift: from “agent systems” to “execution pipelines”

The key shift is conceptual:

«Don’t build teams of agents. Build a pipeline.»

Instead of modeling collaboration, model execution. Pipelines enforce order, reduce ambiguity, and eliminate redundant reasoning.

In a pipeline:

  • each step has a clear responsibility
  • outputs are well-defined
  • transitions are explicit

This mirrors how compilers or CI pipelines work. There’s no ambiguity about what happens next, and no component reinterprets the entire system unless necessary.

This shift alone dramatically reduces token usage because it removes unnecessary decision points.


Minimal architecture that works

Instead of:

orchestrator → subagents → loops

Use:

planner → executor → validator

Planner

The planner interprets the goal and produces a structured task list. It operates at a high level and avoids implementation details.

For example, given “upgrade React and fix breaking changes,” the planner might output:

  • update package.json
  • run install
  • fix deprecated API usage in components
  • update tests

This step is cheap because it deals in abstractions, not code.


Executor

The executor performs one task at a time. Its input is tightly scoped, and its output is minimal—typically a diff.

For example:

- useEffect(() => { ... }, [])
+ useEffect(() => { ... }, [dependency])

There is no explanation, no re-analysis—just execution. This keeps token usage low and predictable.


Validator

The validator checks correctness using tests, linting, or rules. It does not attempt fixes.

For example:

Tests: FAIL
- Component X breaks due to missing prop

This separation ensures that validation does not trigger new reasoning loops inside the same step.


What about subagents and skills?

Subagents are often misunderstood.

What they are actually for

They exist for capability routing, tool binding, and context isolation. For example, a “test” subagent might have access to a test runner, while an “upgrade” subagent can run package managers.

This allows systems like to constrain behavior and reduce risk.

What they are NOT

They are not independent thinkers or recursive systems. When used this way, they amplify the very problems they were meant to solve.


Skills: the right abstraction

Skills are the practical way to structure capabilities.

A good skill behaves like a function:

  • clear input
  • predictable output
  • no side effects beyond its scope

Skill: upgrade_dependency

Input:

  • package name
  • version

Output:

  • diff
  • short summary (<=100 tokens)

This makes skills composable and efficient. Instead of asking an agent to “figure things out,” you tell it exactly what to do within strict boundaries.


What broke in my original setup

The initial system had:

  • granular agents (coder, tester, auditor)
  • orchestrators coordinating them
  • agents performing multiple steps internally

This created overlapping responsibilities. Multiple agents would analyze the same problem from scratch, often reaching similar conclusions independently.

A single feature task could trigger:

  • planning
  • coding
  • testing
  • auditing
  • re-coding

Each step reintroduced context and reasoning, leading to rapid token growth.


The fix

1. Collapse roles to single-purpose units

Breaking agents into strict roles removes ambiguity. An implementer writes code. A validator checks it. A planner decides what to do.

This reduces the cognitive load on each step and prevents hidden loops.


2. Eliminate agent-to-agent calls

Direct delegation between agents introduces recursion. Instead, all coordination should happen at a single level, with explicit sequencing.

This ensures that each step is visible, measurable, and controllable.


3. Replace context with artifacts

Passing full transcripts or files is expensive. Instead, pass:

  • diffs (what changed)
  • test results (what failed)
  • compressed summaries (what matters)

For example, instead of passing a full file, pass:

Changed function: login()
Issue: missing null check

This keeps context small and relevant.


4. Enforce hard limits

Without constraints, agents default to exploration. They read more files, try more approaches, and loop longer.

Limits such as:

  • max 3 iterations
  • max 2 files per step
  • stop if no progress

…force the system to converge quickly.


5. Separate thinking from doing

Planning and execution should not happen together. A planning step defines the path. Execution follows without rethinking.

This prevents redundant reasoning and reduces token usage significantly.


Token math (why this matters)

A naive setup:

(4k context + 1k output) × 6 iterations = ~30k tokens

Optimized:

(1k state + 0.5k output) × 3 steps = ~4.5k tokens

The reduction comes from:

  • smaller context
  • fewer iterations
  • no duplication

Trade-offs

The optimized approach reduces cost and increases predictability. However, it requires more upfront structure and discipline.

You lose some flexibility—agents won’t “figure things out” dynamically. But in practice, that flexibility often translates to inefficiency rather than better outcomes.


Failure modes (and fixes)

Over-compression

Too little context leads to incorrect assumptions. Include interfaces and contracts to preserve meaning without full implementations.

Planner mistakes

A bad plan propagates errors. Require planners to state assumptions and risks to make issues visible early.

Premature stopping

Strict limits can halt valid progress. Allow controlled continuation when necessary.

Over-fragmentation

Too many skills create overhead. Focus on a small set of reusable, well-defined capabilities.


When NOT to use agents

Agents are unnecessary for simple or well-defined tasks. If you know what to do, a single prompt is faster and cheaper.

Agents are most useful when:

  • tasks require decomposition
  • outcomes are uncertain
  • multiple steps are needed

Key insight

«Intelligence isn’t your bottleneck. Tokens are.»

Most systems fail not because they lack capability, but because they waste resources. Reducing loops, context, and ambiguity has a greater impact than improving reasoning.


Final takeaway

Effective agent systems are not about autonomy. They are about control:

  • controlled context
  • controlled execution
  • controlled iteration

Once those constraints are in place, everything else—cost, speed, reliability—improves as a consequence.

A Practical Template for Token-Efficient Agents

All of this theory only matters if it translates into something you can actually build.

What follows is a minimal, production-oriented template that implements the ideas above. It deliberately avoids anything clever. The goal is not flexibility—it’s predictability and cost control.


The shape of the system

At a high level, the system is just a pipeline:

Planner (1 call, cheap)
  → Task list

FOR each task:
  → Executor (1 call, diff only)
  → Validator (1 call, pass/fail)

Optional:
  → Retry loop (max 2)

There are no recursive agents. No delegation. No hidden loops.

Every task flows through the same three steps. That consistency is what keeps token usage under control.


State is explicit, not implied

One of the biggest mistakes in agent systems is relying on implicit conversation state. That works in chat, but breaks down quickly in multi-step workflows.

Instead, everything is made explicit and serializable:

type Task = {
  id: string
  type: "implement" | "test" | "upgrade" | "analyze"
  target: string
  description: string
}

type State = {
  goal: string
  tasks: Task[]
  currentTaskIndex: number
  artifacts: {
    diffs: string[]
    testResults: string[]
    notes: string[]
  }
  attempts: number
}

This does two things:

  1. It removes the need to pass full conversational history
  2. It forces you to think in terms of artifacts, not transcripts

Instead of “what did we talk about?”, the system asks:

“What do we know, and what changed?”


The planner: think once, cheaply

The planner is the only place where real “thinking” happens. Its job is to take a goal and turn it into a small, concrete set of tasks.

Goal:

"Upgrade Next.js to the latest stable version, keep tests green, and document breaking changes."

Planner output (strict JSON, no prose):

[
  {
    "id": "t1",
    "type": "upgrade",
    "target": "package.json",
    "description": "Bump next and related peer deps to latest stable versions."
  },
  {
    "id": "t2",
    "type": "implement",
    "target": "src/",
    "description": "Apply required code changes for API or behavior differences."
  },
  {
    "id": "t3",
    "type": "test",
    "target": "ci",
    "description": "Run lint and tests, capture failures, and report status."
  },
  {
    "id": "t4",
    "type": "analyze",
    "target": "docs/changelog.md",
    "description": "Summarize migration impact and update release notes."
  }
]

Important constraints:

  • cap tasks at a small number (usually 3-7)
  • force atomic descriptions
  • reject vague tasks like “improve quality”

If planning is expensive, everything after it gets expensive too. Keep this step small and deterministic.


The executor: context is scoped per task

The executor receives only:

  • the current task
  • the relevant diff or files
  • a compact summary of prior artifacts

It does not receive the full transcript or unrelated files.

This is where most token waste is eliminated. You pay for exactly the context needed to do one unit of work.


The validator: binary, ruthless, fast

The validator should return a constrained result:

type ValidationResult = {
  status: "pass" | "fail"
  reason: string
}

No long essays. No speculative advice. Just enough information to decide whether to proceed or retry.

This keeps downstream branching simple:

  • pass → commit artifact and move to next task
  • fail → retry with tightened instructions (up to max attempts)

Retry loop: bounded recovery, not open-ended exploration

Retries are useful, but only when bounded.

A practical rule:

  • max attempts per task: 3 (1 initial run + up to 2 retries)
  • each retry must include a delta instruction (what to change)
  • if still failing, escalate to human review

Unbounded retries turn small failures into massive token burns.


Minimal orchestration pseudocode

for (const task of state.tasks) {
  const maxAttempts = 3
  let attempts = 0
  let done = false

  while (!done && attempts < maxAttempts) {
    attempts += 1
    const execution = await runExecutor(task, state.artifacts)
    const validation = await runValidator(task, execution)

    if (validation.status === "pass") {
      state.artifacts.diffs.push(execution.diff)
      state.artifacts.notes.push(validation.reason)
      done = true
    } else {
      state.attempts += 1
      state.artifacts.notes.push(`retry:${task.id}:${validation.reason}`)
    }
  }

  if (!done) throw new Error(`Task failed: ${task.id}`)
}

That’s it. No hidden planner re-entry. No implicit memory. No autonomous branching.


Production defaults that keep costs sane

  • hard token budget per task
  • hard timeout per model call
  • strict schema validation for planner output
  • deterministic prompts (versioned, tested, short)
  • artifact logging for replay and audit

When budgets are explicit, failures become debuggable instead of mysterious.


Final point

If you want token-efficient agents, don’t start with smarter prompts. Start with stricter system design.

A small, boring, explicit pipeline will beat a “clever” autonomous setup almost every time—on cost, reliability, and speed.