Is Your AI Agent Producing Garbage? The Problem Lies in Your Reluctance to “Burn” Tokens

2026.03.23

Is Your AI Agent Producing Garbage? The Problem Lies in Your Reluctance to “Burn” Tokens

The problem isn’t with the prompt!

2026.03.23 - 05:49:07

Navigating Web3 tides with focused insights

The problem isn’t with the prompt!

Author: Systematic Long Short

Translation & Editing: TechFlow

TechFlow Intro: The core argument of this article can be summed up in one sentence: the output quality of an AI Agent is directly proportional to the number of tokens you invest.

The author isn’t speaking abstractly or theoretically—instead, they present two concrete, immediately actionable methods—and clearly define the boundary beyond which throwing more tokens won’t help: the “novelty problem.”

For readers currently using Agents to write code or run workflows, this piece delivers high information density and strong practical utility.

Introduction

Okay, you have to admit—the headline is attention-grabbing. But seriously, this isn’t a joke.

In 2023, when we were already deploying production-grade code generated by LLMs, people around us were stunned—because the prevailing belief at the time was that LLMs could only produce unusable garbage. Yet we knew something others hadn’t realized: an Agent’s output quality is a function of the number of tokens you invest. That’s it—simple as that.

You can verify this yourself with a few quick experiments. Ask your Agent to complete a complex, relatively obscure programming task—for example, implementing from scratch a constrained convex optimization algorithm. First, run it at the lowest reasoning setting; then switch to the highest reasoning setting and have it review its own code—how many bugs does it catch? Try medium and high settings too. You’ll observe intuitively: the number of bugs decreases monotonically as token investment increases.

This isn’t hard to understand, right?

More tokens = fewer errors. You can extend this logic further—it’s essentially the (simplified) core idea behind code-review products. Switch to a completely fresh context and invest massive tokens—for instance, instructing the Agent to parse code line-by-line and assess whether each line contains a bug. This approach catches the vast majority—or even all—bugs. You can repeat this process ten times, a hundred times, each time examining the codebase “from a different angle,” and eventually uncover every bug.

There’s also empirical support for the claim that “burning more tokens improves Agent quality”: teams claiming their Agents can write production-ready code end-to-end are either foundational model providers themselves—or extremely well-funded companies.

So if you’re still struggling with Agents failing to produce production-grade code—let’s be blunt—the issue lies with you. Or rather, with your wallet.

How to Tell Whether You’re Burning Enough Tokens

I’ve written an entire article arguing that the problem absolutely does not lie in your framework (“harness”)—“keeping it simple” can still yield excellent results—and I still firmly believe that. You read that piece, followed its advice—but remain deeply disappointed by your Agent’s output. You DM’d me; I read it but didn’t reply.

This article is my reply.

Your Agent performs poorly and fails to solve problems—most of the time, simply because you’re not burning enough tokens.

How many tokens a problem requires depends entirely on its scale, complexity, and novelty.

“What is 2 + 2?” requires almost no tokens.

“Build me a bot that scans all markets across Polymarket and Kalshi, identifies semantically similar markets that should settle before/after the same event, sets no-arbitrage bounds, and executes low-latency automated trades whenever arbitrage opportunities arise”—that demands a huge token burn.

We’ve observed something interesting in practice.

If you invest sufficient tokens to address problems arising from scale and complexity, your Agent will *always* succeed. In other words, if you’re building something extremely complex—with many components and thousands of lines of code—throwing enough tokens at those problems will ultimately resolve them completely.

There is, however, one small—but critical—exception.

Your problem must not be too novel. At this stage, no amount of tokens can solve the “novelty problem.” Sufficient tokens can reduce errors caused by complexity to zero—but they cannot make an Agent spontaneously invent something it has never encountered before.

That conclusion, frankly, was a relief.

We invested enormous effort—and burned *a lot*, *a lot*, *a lot* of tokens—trying to see whether an Agent could reconstruct institutional investment processes with minimal guidance. Partly, we wanted to gauge how many years remain before quantitative researchers like us are fully replaced by AI. The result? The Agent couldn’t even approximate a credible institutional investment process. We believe this is largely because such processes simply don’t exist in its training data.

So—if your problem is novel, don’t expect stacking tokens to solve it. You need to guide the exploration yourself. But once you’ve settled on an implementation strategy, feel free to stack tokens freely during execution—no matter how large the codebase or how intricate the components.

Here’s a simple heuristic: your token budget should scale proportionally with lines of code.

What Extra Tokens Actually Do

In practice, additional tokens improve Agent engineering quality in several key ways:

They allow the Agent more time to reason within a single attempt—giving it a chance to spot flawed logic itself. Deeper reasoning = better planning = higher probability of hitting the target on the first try.

They enable multiple independent attempts, exploring different solution paths. Some paths are objectively better than others. Allowing more than one attempt lets the Agent select the optimal one.

Likewise, more independent planning attempts let it discard weak directions and retain only the most promising ones.

More tokens permit the Agent to critique its prior work using a completely fresh context—giving it a chance to improve, rather than getting stuck in “reasoning inertia.”

And finally—my personal favorite—more tokens mean it can use tests and tools to validate. Actually running the code to see whether it works remains the most reliable way to confirm correctness.

This logic holds because Agent engineering failures are rarely random. They almost always stem from choosing the wrong path too early, failing to verify whether that path is viable (early on), or lacking sufficient budget to recover and backtrack after detecting an error.

That’s the story. Tokens literally buy you decision quality. Think of it like research work: if you ask someone to answer a difficult question under tight time pressure, answer quality drops.

Research, fundamentally, produces the basic thing: “knowing the answer.” Humans expend biological time to generate better answers; Agents expend computational time to do the same.

How to Improve Your Agent

You may still be skeptical—but numerous papers support this view. Honestly, the mere existence of a “reasoning” tuning knob is proof enough.

One paper I especially admire trained a model on a small, carefully curated set of reasoning examples—and then forced it to continue thinking even when it wanted to stop. Their method? Simply appending the word “Wait” at points where the model tried to halt. That single intervention lifted performance on a benchmark from 50% to 57%.

Let me be as direct as possible: if you’ve been complaining that your Agent writes mediocre code, then even the highest single-try reasoning setting is likely still insufficient for you.

I offer you two very simple solutions.

Simple Fix #1: WAIT

The simplest thing you can do today: build an automatic loop—after initial construction, have the Agent review its output N times in fresh contexts, fixing issues each time.

If this simple trick improves your Agent’s engineering outcomes, then you’ve confirmed your problem is purely one of token volume—welcome to the Token-Burning Club.

Simple Fix #2: VERIFY

Get your Agent to verify its own work early and often. Write tests to prove the chosen path actually works. This is especially valuable for highly complex, deeply nested projects—a function might be called by dozens of downstream functions. Catching errors upstream saves massive amounts of downstream compute (tokens). So wherever possible, sprinkle “verification checkpoints” throughout the build process.

Agent finishes writing a section and declares it done? Bring in a second Agent to verify it. Independent reasoning streams help cover systemic bias sources.

That’s basically it. I could write much more on this topic—but I’m convinced that recognizing these two points and executing them rigorously solves 95% of real-world problems. I firmly believe in mastering the simple things first—and adding complexity only as needed.

I mentioned “novelty” as the one problem tokens cannot solve—I want to reiterate that point, because you *will* hit this wall sooner or later—and then come crying to me that “stacking tokens doesn’t work.”

When your problem lies outside the training data, *you* are the one who must supply the solution. Domain expertise remains critically important.

Join TechFlow official community to stay tuned

Telegram:https://t.me/TechFlowDaily

X (Twitter):https://x.com/TechFlowPost

X (Twitter) EN:https://x.com/BlockFlow_News

Source

Add to Favorites

Share to Social Media

Author

sysls

@systematicls

Is Your AI Agent Producing Garbage? The Problem Lies in Your Reluctance to “Burn” Tokens

TechFlow Selected TechFlow Selected