
AI Investors’ 2026 Anxiety: When Models Devour Everything, What Moats Remain for Startups?
TechFlow Selected TechFlow Selected

AI Investors’ 2026 Anxiety: When Models Devour Everything, What Moats Remain for Startups?
2026 Investor Edition of AI Panic: Just throw all your money at Anthropic and NVIDIA, then go home and sleep.
Author: Sarah Guo
Translated and edited by TechFlow
TechFlow Intro: As large models begin outperforming humans across all benchmarks, investors are falling into a kind of despair: “Besides Anthropic and NVIDIA, what’s even worth investing in?” This top-tier Silicon Valley investor uses data and real-world cases to show that true moats aren’t found on leaderboards—they reside precisely where benchmarks can’t reach.
By mid-2026, investors’ version of AI-induced psychosis is despair: “There’s nothing left to invest in—we should just throw all our money at Anthropic and NVIDIA and go home.”
I’ve never felt this way. I’m already convinced models are several versions smarter than me; I’m happy to buy Anthropic and NVIDIA at market prices; all my smartest friends are fairly confident self-improvement will succeed soon—but I still don’t feel this despair.
This despair isn’t irrational. The logic goes like this: if models keep improving across everything, then every company built atop them is merely a thin wrapper waiting to be absorbed—and the only value that survives is compute and frontier weights.
Take software as an example—the case most relied upon by despair theorists. When Devin launched in 2024, it solved only 13% of tasks on standard software benchmarks and was largely ignored. A year and a half later, the best agents score over 80 points—and they’re doing real work inside Goldman Sachs and the U.S. Army. Almost everyone draws the same mistaken lesson: models have eaten software engineering. But as models consume the most easily measurable parts of software engineering, we’re rediscovering something many teams already knew—engineering has always resisted measurement, and the most measurable parts may not be the only important ones.
Mert Demirer of MIT and his collaborators finally put numbers to it: among more than 100,000 developers, the latest coding agents increased code written by ~180%, while production-deployed code rose by only ~30%. Writing code got cheaper. The rest still requires people—and remains critically important. Of course, the net impact is still staggering.
Benchmarks measure what you can measure—and what you can measure is what you can train against. So coding agents matured first: compilers are free validators; test suites are free validators; when answers check themselves for free, you can iterate endlessly until you beat the benchmark. But passing tests never tells you whether a change is *correct* for a decade-old codebase with three undocumented modules whose reasons for existence remain unknown—and whose deployment pipeline barely limps along thanks to a cron job no one admits to writing.
That kind of correctness doesn’t appear on leaderboards—and in fact, doesn’t appear anywhere. You learn it only by running a complex system long enough in the real world to see whether it works—and smarter models won’t make the world run faster. No one runs unit tests at Google scale and trusts the green checkmark; you trust it because it’s endured years of real load. Such correctness isn’t just proprietary—it’s a slow, capital-resistant moat. Even optimists admit clocks can’t be skipped: Noam Brown, pioneer of OpenAI’s reasoning models, recently wrote that the only reliable way to evaluate an agent over a one-year horizon may well be… running it for a year.
As Gabe Pereyra puts it, real automation isn’t just about models getting better. It’s about products, models, workflows, and companies moving together—and three of those four move at organizational speed.
The moving parts are exactly what benchmarks miss: convincing a skeptical partner to change how she handles her work; keeping a team united through rebuilding. That’s why, when we hire CEOs, people-handling ability matters at least as much as analytical skill—and smarter models won’t shift that weight. Feedback is fuzzy; time horizons span years; trust resides in individuals. Every company I know has rolled out cutting-edge coding models to all engineers—but none has restructured its engineering organization at anything close to that pace. Adoption took a quarter—that magical token-growth quarter!—but rebuilding is taking years.
What’s visible is what’s leaving. Valuable work is structurally invisible: anything you can put on a leaderboard, you can train against—so anything measurable is already marching toward commoditization. That process takes time and is never complete—but its direction is irreversible. In the monetary terms used by my Rippling colleague Matt MacInnis: tokens spent answering generic questions are nearly worthless, because any model can answer them; tokens spent reasoning over *your* company’s data carry far higher value—they do what you actually want, not just what looks plausible.
Visible work gets eaten from two directions. From below: task saturation. Once a job becomes cheap to verify, buyers stop asking *which* model did it—and start asking *how much it cost*, pushing work onto whichever open-source or distilled model is cheapest that week. Profit margins ultimately matter wherever these agents make an impact. From above: labs are trying to get models to eat their own scaffolding—retrieval, routing between cheap and expensive calls, tool use, even reasoning strategies. All the apparatus once wrapped around models gets pulled into the weights—until the wrapper *is* the model. That’s absorption at the frontier. Margin pressure cuts both ways: general-purpose agents must be ready for anything—which is expensive—while focused applications can tune a workflow until it runs on a fraction of the token budget. And unlike labs selling those tokens, they keep the spread.
So for any kind of work, we can ask two questions. Is its correctness proprietary and costly to establish—a truth that exists only within someone’s data? Is it isolated—locked inside systems you can’t access? Contrast those with the degree of task saturation, and you get a 2x2 matrix. Saturated work with public answers is commodity tokens—owned by open-source models. Frontier work with public answers—the domain of coding benchmarks—is where labs win, because when evaluation is free, owning it isn’t valuable. The prize lies in the final quadrant: frontier work whose correctness is untrainable—because it lives solely in private domains. You see it in inference clouds hosting AI-native pioneers, where the vast majority of tokens come from custom models—not generic open-source ones.
The wall into that final quadrant varies in height. A solo developer’s toy codebase is portable and standardized—so the climb is short. A bank’s production system is neither—and you won’t earn root access just by scoring 2% higher on SWE-Bench Verified.
Capability eats many things—but better models won’t turn private ground truths into public facts. They hold no licenses, sign no liability, own no corporate documents—and when answers go wrong, they can’t be sued. Intelligence isn’t the bottleneck here. Licensing is. Liability is. You can imagine a model vastly smarter than anyone—and it still needs permission to enter; someone still has to sign off on its output.
That door has both a lock and a latch. The lock is environment: you only gain trust *inside* the system—after security reviews, integrations, and contracts where you sign off on results—before you can even verify whether AI delivered useful work. The latch is users. Right now, most U.S. doctors open OpenEvidence daily—and no amount of compute can buy that. Labs could train a perfect medical model tomorrow and still couldn’t enter doctors’ habits—or UCSF’s decision-making processes—because trust builds slowly, relationally, requiring user consent—not gradient descent erasure.
That’s work too. An application earns its place in the untrainable quadrant by doing unglamorous work: arranging a company’s private reality so models can act on it; equipping models with tools to act; partnering with customers to reshape their employees’ reality. A company delivering translation is hard to replicate—and translation never ends. Integration and maintenance last as long as relationships do—won by teams placing domain-specialist engineers and tools beside customers.
For example, at a top-tier white-shoe law firm, M&A alone runs nearly 1,000 transactions annually. Due to confidentiality and other constraints, you can’t let hundreds of associates each download client files to their desktops and ask a generic agent to sift through them—even if you could, you’d learn only fragments: corrections from one associate at a time, missing how the whole transaction flows. Critical signals live at the *transaction level*—and transactions have shapes: for M&A, it’s NDAs, term sheets, due diligence, purchase agreements, ancillary docs, closing checklists; for IP litigation, it’s motions, evidence discovery, prior art, more motions. Each practice area has its own shape—and lawyers and tools aren’t cross-transferable across domains. What law firms *actually solve* sits one layer above all that: running each practice area in parallel—like top partners simultaneously managing hundreds of matters, launching new ones, and training associates. Transforming such a firm isn’t a single task you can benchmark. It demands an operator using data-driven methods—with extremely fuzzy goals, incomplete feedback, long time horizons—in an environment that never stands still.
Unfortunately, invisible value is also hard to sell—for the same reason it resists commoditization: companies can’t externally judge whether AI will transform their operations, just as benchmarks can’t. So the strongest enterprises stop trying to prove it externally—and instead go internal, pricing outcomes. Sierra charges when its agent solves a customer problem, but not when it kicks to human—so price *becomes* the evaluation, which only works if Sierra owns the definition of “solved.” Cognition’s Devin does the same in software, offering “performance guarantees”—a promise only meaningful inside systems where you’re trusted to deliver results.
Even service tokens—the layer everyone calls pure commodities—don’t behave like commodities. The best AI-native companies concentrate their services on one or two providers (Baseten or Fireworks), because per-token cost commoditizes on schedule—while reliability under real traffic and guaranteed access to scarce compute do not. Where you serve is a different choice than which models you use. Price is the *only* part of inference that behaves like a commodity.
A common objection: labs are your suppliers—why won’t they run their first-party products below cost to squeeze you out, or revoke your API access and capture the market themselves? That’s the real despair argument—and it only holds if the model layer is a solo game. It clearly isn’t—it looks more like a death race among three and a half players, with international competitors lagging six months on training and alliance sizes five times larger than last year. Customers want supplier competition; labs want market share—not to kill any single application.
You see this in markets where labs directly compete. In consumer chat, the best model never simply wins. ChatGPT held the lead for years through real competition—and its current share losses flow to Gemini, powered by Android and Search—not better models. Anthropic—currently rated by prediction markets (and internet sentiment) as having the best model—barely registers in consumer chat, yet built its business in enterprise and coding. If better models can’t steal competitors’ users in the most core application, they won’t penetrate hospitals’ records or banks’ liability chains via integration. Today’s public choices hinge on more than coding. If the frontier stays crowded, the layers above it will be valuable.
If work can’t be scored externally, *someone internally* must decide what even counts as a good answer—and that decision *is* the entire game. Enough such decisions, written down, become a benchmark. Harvey released one for law; Sierra released one for voice agents. You win the right to define “good” for a domain by becoming the one already used *in* that domain—these companies earned that authority through battles of real adoption.
The evaluations that move real money are private and company-specific: *this* company, *this* type of transaction, will accept *this* as good work—still very much unfinished, because law’s depth dwarfs any public test. OpenEvidence is defining what a safe clinical answer looks like. None of these are true measurements—they’re judgments about what’s *true* and what’s *good*, written down until they become the standard others get measured against—and foundational labs, no matter how brilliant, can’t write them, because that authority lives only *within* the domain. That authority tends to stay where it already sits. Senior lawyers write legal benchmarks. Defining safe clinical answers falls to physicians. And “solved” means whatever company already has customers says it means.
The frontier of absorption keeps rising—as we learn to measure more work, the measurable gets eaten. The untrainable ground shrinks beneath anyone standing on it—so you can’t find a defensible spot and rest. You keep advancing toward anything still unscorable—and constantly re-underwrite. On narrow tasks, using your private data and your own evaluation criteria, you can train to the frontier—and beat general models where it matters. That specialized model becomes part of your moat. By contrast, competing on general models is a capital war—you lose to whoever has the most compute. That’s the trap for companies with shallow access and visible tasks. It promises survival through surpassing frontier training on generic tasks—and winners seem dictated by datacenter scale, with outcomes usually not independent champions but acquisitions by compute-rich players.
All this is defense. Offense is harder: choosing what to build first. That’s what I spent a year searching for—and may have found three times. Models don’t help here. They’ll do anything you point them at—but can’t tell you *what’s worth pointing at*. You can’t benchmark that—so you can’t train it. That’s why incumbents won’t take everything: they hold onto what they own, and the next thing comes from those who discover use cases before the rest of us. Perhaps intent is rarer than compute.
Despair theory gets half the story right. Thin wrapper layers *are* being absorbed—and much of what looks like a company today *is* a thin wrapper. But it’s wrong about what remains. The mechanism is clear; the destination isn’t. My bet is on the direction: intelligence keeps getting cheaper, and value keeps sliding toward the few places models can’t reach. The untrainable carries historical value. So pick one domain, do the unglamorous translation work, start writing down what “good” means there—because someone will.
The most-cited benchmark score this year is a map of territory about to become worthless—and a notice of who’s about to lose the right to define what “good” means.
Join TechFlow official community to stay tuned
Telegram:https://t.me/TechFlowDaily
X (Twitter):https://x.com/TechFlowPost
X (Twitter) EN:https://x.com/BlockFlow_News














