
Does Claude or Codex Get Dumber the More You Use Them? Because Your Context Is Too Bloated
TechFlow Selected TechFlow Selected

Does Claude or Codex Get Dumber the More You Use Them? Because Your Context Is Too Bloated
From controlling context and handling AI’s tendency to please, to defining task termination conditions—this is the clearest explanation of Claude/Codex engineering practices I’ve seen so far.
Author: sysls
Compiled by TechFlow
TechFlow Intro: A developer blogger with 2.6 million followers, sysls, published a hands-on, long-form article that was shared 827 times and liked 7,000 times. Its core message is just one sentence: your plugins, memory systems, and various harnesses are likely doing more harm than good. This article avoids high-level theory entirely—it distills actionable principles from real production projects: how to control context, handle AI’s “pleasing bias,” and define precise task termination conditions. To date, it’s the clearest explanation available of Claude/Codex engineering practice.
Full text below:
Introduction
You’re a developer who uses Claude and Codex CLI daily—and every day you wonder whether you’re truly squeezing every drop of capability out of them. Occasionally, you witness them do something astonishingly stupid, leaving you baffled as to why some people seem to be building rockets with AI while you can’t even stack two stones stably.
You suspect it’s your harness, your plugins, or your terminal setup. You’ve tried beads, opencode, zep; your CLAUDE.md file spans 26,000 lines. Yet no matter how much you tinker, you still don’t understand why you’re drifting further from heaven while others frolic with angels.
This is the article you’ve been waiting for.
Also, I have no vested interest. When I refer to CLAUDE.md, I include AGENT.md; when I say “Claude,” I mean both Claude and Codex—I use both extensively.
Over the past few months, I’ve observed something interesting: almost nobody truly knows how to maximize agent capabilities.
It feels like a tiny fraction of users can build entire worlds with agents, while the rest drown in an ocean of tools—suffering from choice paralysis, believing that finding the right package, skill, or harness combination will unlock AGI.
Today, I want to dismantle all that—and leave you with one simple, honest sentence, then build from there. You don’t need the latest agent harness. You don’t need to install a million packages. And you certainly don’t need to read a million articles to stay competitive. In fact, your enthusiasm may well be doing more harm than good.
I’m not here for sightseeing—I’ve used agents since they could barely write code. I’ve tried every package, every harness, every paradigm. I’ve built signals, infrastructure, and data pipelines using agent factories—not “toy projects,” but real, production-deployed use cases. After all that…
Today, I achieve my most breakthrough work using an almost absurdly minimal configuration: only basic CLI tools (Claude Code and Codex), plus a solid grasp of a few foundational principles of agent engineering.
Understanding That the World Is Moving at Breakneck Speed
First, let me state this plainly: foundational model companies are in the midst of a historic sprint—and clearly show no signs of slowing down. Each new leap in “agent intelligence” reshapes how you collaborate with them, because agents are increasingly designed to obey instructions faithfully.
Just a few generations ago, if you wrote in CLAUDE.md, “Read READTHISBEFOREDOINGANYTHING.md before doing anything,” there was a 50% chance the agent would reply, “Go to hell,” and proceed to do whatever it pleased. Today, it obeys most instructions—even complex nested ones. For example, “First read A, then read B; if C holds, read D.” Most of the time, it happily follows along.
What does this imply? The most important principle is recognizing that each new generation of agent forces you to rethink what constitutes the optimal solution—which is precisely why less is more.
When you rely on many different libraries and harnesses, you lock yourself into a “solution”—but that very problem may vanish entirely with the next-generation agent. Do you know who the most enthusiastic and heavy users of agents are? Yes—the employees at cutting-edge companies, who enjoy unlimited token budgets and access to the absolute latest models. Do you realize what that means?
It means that if a genuine problem exists—and has a good solution—those frontier companies will be the biggest adopters of that solution. What do they do next? They integrate that solution directly into their products. Think about it: why would a company allow another product to solve a real pain point and create external dependencies? How do I know this is true? Look at skills, memory harnesses, sub-agents—they all began as practical “solutions” to real problems, proven effective through real-world usage.
So if something is genuinely breakthrough and meaningfully expands agent use cases, it will inevitably be absorbed into the core offerings of foundational model companies. Trust me—they’re moving at lightning speed. So relax: you don’t need to install anything or depend on any external tooling to do your best work.
I predict the comments section will quickly fill with: “SysLS, I used [X] harness—it’s amazing! I rebuilt Google in a day!” To which I say: Congratulations! But you’re not the target audience—you represent an extremely, extremely niche segment of the community: those who truly understand agent engineering.
Context Is Everything
Really. Context is everything. Another problem with using a thousand plugins and external dependencies is that you suffer severely from “context bloat”—your agent gets drowned in too much information.
Let me ask the agent to play a Python word-guessing game? Easy. Wait—what was that “manage memory” note from 26 conversations ago? Oh, the user had a screen freeze 71 conversations ago because we spawned too many subprocesses. Always write notes? Sure, no problem… But how does that relate to a word-guessing game?
You get the idea. You want to give the agent *exactly* the information needed to complete the task—and nothing more, nothing less! The better your control over this, the better the agent performs. Once you start introducing strange memory systems, plugins, or skills with confusing naming and invocation patterns, you’re handing the agent both a bomb-making manual and a cake-baking recipe—while all you asked for was a short poem about redwood forests.
So again, I preach: strip away all dependencies—and then…
Do Things That Actually Matter
Precisely Specify Implementation Details
Remember: context is everything?
Remember wanting to inject *exactly* the information the agent needs to complete the task—no more, no less?
The first way to achieve that is to separate research from implementation. Be ruthlessly precise about what you’re asking the agent to do.
What happens when you’re imprecise? “Build an authentication system.” Now the agent must research: What *is* an authentication system? What options exist? What are their trade-offs? It starts scouring the web for information it doesn’t actually need—cluttering its context with irrelevant implementation possibilities. By the time actual implementation begins, it’s more likely to get confused—or generate unnecessary or irrelevant hallucinations around the chosen approach.
Conversely, if you say, “Implement JWT authentication using bcrypt-12 password hashing, refresh token rotation, 7-day expiration…”, it doesn’t need to research alternatives. It knows exactly what you want—and can fill its context with concrete implementation details.
Of course, you won’t always know the implementation details. Often you don’t know what’s correct—or even want to delegate that decision to the agent. What then? Simple: create a dedicated research task to explore implementation options—then either decide yourself, or let the agent choose—and finally assign implementation to a *fresh* agent with a clean, newly constructed context.
Once you begin thinking this way, you’ll spot places where your agent’s context is unnecessarily polluted across your workflow—and you can erect isolation walls within your agent workflow, abstracting away irrelevant information and retaining only the specific context that enables peak performance on the task. Remember: you’re working with an exceptionally talented, intelligent team member who understands every kind of sphere in the universe—but unless you tell him you want to design a space where people dance and have fun, he’ll keep lecturing you about the virtues of spherical objects.
The Design Limitation of Pleasing Bias
No one wants to use a product that constantly criticizes them, tells them they’re wrong, or ignores their instructions outright. So these agents strive to agree with you—to do exactly what you ask.
If you ask it to insert the word “happy” after every three words, it will try hard to comply—most people understand this. Its obedience is precisely what makes it such a useful product. But this has a fascinating side effect: it means that if you say, “Help me find a bug in the codebase,” it *will* find a bug—even if it has to fabricate one. Why? Because it *desperately* wants to follow your instruction!
Most people quickly complain about LLMs hallucinating or inventing nonexistent things—yet fail to recognize the root cause lies in *themselves*. You ask for something, and it delivers—stretching facts if necessary!
So what do you do? I’ve found “neutral prompts” highly effective—prompts that don’t bias the agent toward any particular outcome. Instead of saying, “Help me find a bug in the database,” I say, “Scan the entire database, trace the logic flow through each component, and report back all findings.”
This neutral prompt sometimes uncovers bugs—but often simply describes objectively how the code behaves. Crucially, it doesn’t prime the agent with the presupposition that “a bug must exist.”
Another way to handle pleasing bias is to turn it into an advantage. I know the agent is trying hard to please me and follow my instructions—I can steer it deliberately in either direction.
So I deploy a “bug-finding agent” tasked with identifying *all* bugs in the database, assigning +1 point for low-impact bugs, +5 for medium-impact, and +10 for high-impact ones. I know this agent will enthusiastically identify every possible type of bug—including borderline or non-bugs—and report back a score like “104 points.” I treat this output as a superset of *all possible* bugs.
Then I deploy a “counter-agent” to refute each reported bug, awarding it the same point value for each successful refutation—but penalizing it with -2× the bug’s point value for each incorrect refutation. This agent works hard to refute as many bugs as possible—but the penalty mechanism keeps it cautious. It still actively “refutes” bugs—including real ones. I treat its output as a subset of *only the real* bugs.
Finally, I deploy a “judge agent” to synthesize inputs from both agents and assign scores. I tell the judge agent I hold the ground-truth answer: +1 point for each correct judgment, -1 for each incorrect one. It thus scores both the bug-finder and counter-agent on each reported “bug.” Whatever the judge declares as truth, I verify. Most of the time, this method yields surprisingly high fidelity—occasionally it errs, but it’s already close to error-free.
You might find a standalone bug-finding agent sufficient—but this method works exceptionally well for me because it leverages each agent’s innate programming: the desire to please.
How to Judge What’s Useful—and Worth Using
This question seems daunting—like it demands deep study and constant tracking of AI frontiers—but it’s actually simple… If both OpenAI and Claude have implemented it—or acquired the company that did—it’s almost certainly useful.
Have you noticed how “skills” are now ubiquitous—and part of Claude and Codex’s official documentation? Have you noticed OpenAI’s acquisition of OpenClaw? Have you noticed how Claude promptly added memory, voice, and remote-work capabilities afterward?
What about planning? Remember how many people discovered that planning first, then implementing, was genuinely helpful—and how it quickly became a core feature?
Yes—those are useful!
Remember how endless stop-hooks were super useful, because agents were extremely reluctant to perform long-running tasks—until Codex 5.2 dropped, and that need vanished overnight?
That’s all you need to know… If something is truly important and useful, Claude and Codex *will implement it themselves!* So you don’t need to worry excessively about adopting “new things” or mastering “new things”—you don’t even need to “keep up.”
Do me a favor: occasionally update your chosen CLI tool and skim the changelog for new features. That’s enough.
Compression, Context, and Assumptions
Some users hit a massive pitfall when using agents: sometimes they seem like the smartest beings on Earth; other times, you can’t believe you’ve been fooled by them.
“Is this thing smart? It’s fucking dumb!”
The biggest difference lies in whether the agent is forced to make assumptions—or “fill in the blanks.” Today, they remain terrible at “connecting dots,” “filling gaps,” or making assumptions. As soon as they do, it becomes immediately obvious—and performance plummets.
One of the most critical rules in CLAUDE.md governs how context is obtained—and instructs the agent to read *that rule first* every time it reads CLAUDE.md (i.e., after every compression). As part of this context-acquisition rule, a few simple instructions yield massive impact: re-read the task plan, and re-read all files relevant to the task before proceeding.
Telling the Agent How to End a Task
We humans have a clear sense of when a task is “done.” For agents, the current ceiling of intelligence is that they know how to *start* a task—but not how to *end* it.
This frequently leads to deeply frustrating outcomes: the agent implements a bunch of stubs and calls it quits.
Tests serve as excellent milestones for agents because they’re deterministic—you can set crystal-clear expectations. Unless these X tests pass, the task isn’t done—and you forbid modifying the tests.
Then you simply review the tests; once all pass, you can rest assured. You can even automate this—but the key point is: “task completion” feels natural to humans, but *not* to agents.
Do you know what else recently became a viable task endpoint? Screenshot + verification. You can instruct the agent to implement something until all tests pass—then take a screenshot and verify that the “design or behavior” shown matches expectations.
This lets the agent iterate and converge on your desired design—without stopping prematurely after the first attempt!
A natural extension is to co-create a formal “contract” with the agent—and embed it in your rules. For example, `{TASK}CONTRACT.md` specifies what must be completed before you’re allowed to terminate the session. Inside `{TASK}CONTRACT.md`, you’d specify tests, screenshots, and other validations required before you certify the task as complete!
Agents That Run Forever
One question I’m frequently asked is: how can people run agents for 24 hours while ensuring they don’t go off-track?
Here’s a simple method: create a stop-hook that prevents the agent from terminating the session unless *all* parts of `{TASK}_CONTRACT.md` are satisfied.
If you have 100 such rigorously specified contracts—each capturing exactly what you want built—the stop-hook will block termination until all 100 contracts are fulfilled, including all required tests and validations!
Professional advice: I’ve found that running 24-hour-long, continuously active sessions is *not* optimal for “getting things done.” Partly because this approach structurally forces context bloat—since unrelated contract contexts all accumulate within the same session!
So I don’t recommend it.
Here’s a better way to automate agents: launch a *new session* for each contract. Create a contract whenever you need to accomplish something.
Build an orchestration layer that creates a new contract—and a new session to handle it—whenever “something needs doing.”
This will completely transform your agent experience.
Iterate, Iterate, Iterate
You hire an executive assistant—do you expect them to know your schedule on Day One? Or how you take your coffee? Or that you eat dinner at 6 p.m., not 8 p.m.? Obviously not. You gradually build preferences over time.
Agents are the same. Start with the simplest configuration—forget complex structures or harnesses—and give the basic CLI a fair chance.
Then, incrementally incorporate your preferences. How?
Rules
If you don’t want the agent to do something, codify it as a rule—and tell the agent about it in CLAUDE.md. Example: “Before writing code, read `coding-rules.md`.” Rules can be nested; rules can be conditional! If you’re writing code, read `coding-rules.md`; if you’re writing tests, read `coding-test-rules.md`; if your tests are failing, read `coding-test-failing-rules.md`. You can create arbitrarily complex logical branches for the agent to follow—and Claude (and Codex) will happily comply, provided CLAUDE.md contains clear instructions.
In fact, this is my first practical recommendation: treat your CLAUDE.md as a logical, nested directory—mapping *where to find context* under specific scenarios and desired outcomes. It should be as lean as possible—containing only IF-ELSE logic specifying “under what conditions, where to look for context.”
If you observe the agent doing something you disapprove of, add it as a rule—telling the agent to read that rule *before* performing that action next time. It won’t repeat the mistake.
Skills
Skills resemble rules—but instead of encoding coding *preferences*, they’re better suited for coding *procedures*. If you have a specific way you want something done, embed it as a skill.
In fact, people often complain about not knowing *how* the agent will solve a problem—which feels unsettling. To make it deterministic, first have the agent research how it would solve the problem—then formalize that solution as a skill file. You’ll see *in advance* how the agent handles the problem—and can correct or improve it *before* it encounters the problem in reality.
How do you inform the agent about this skill’s existence? Exactly! You write in CLAUDE.md: “When you encounter this scenario and need to handle this task, read this `SKILL.md`.”
Managing Rules and Skills
You’ll naturally want to keep adding rules and skills to your agent. This is how you imbue it with personality—and encode your preferences. Almost everything else is superfluous.
Once you start doing this, your agent will feel like magic. It will “act the way you want.” Then you’ll finally feel like you’ve “gotten it”—you’ve grasped agent engineering.
Then…
You’ll notice performance starting to decline again.
What’s happening?!
Simple. As you add more and more rules and skills, they begin contradicting each other—or the agent suffers severe context bloat. If you require the agent to read 14 Markdown files before starting to code, it faces the same problem of drowning in useless information.
What do you do?
Clean up. Ask your agent to “get a spa treatment”—consolidating rules and skills, resolving contradictions by having you articulate updated preferences.
Then it’ll feel like magic again.
That’s it. This really is the secret. Keep it simple. Use rules and skills. Treat CLAUDE.md as a directory—and devoutly respect context constraints and design limitations.
Own the Results
There is no perfect agent today. You can delegate much of the design and implementation work to agents—but *you* own the results.
So proceed with care… and then enjoy yourself!
Playing with tomorrow’s toys—while clearly using them for serious work—is genuinely fun!
Join TechFlow official community to stay tuned
Telegram:https://t.me/TechFlowDaily
X (Twitter):https://x.com/TechFlowPost
X (Twitter) EN:https://x.com/BlockFlow_News












