
a16z: OpenAI and Others Won’t Kill All Application-Layer Opportunities—Put Down Your AI Anxiety
TechFlow Selected TechFlow Selected

a16z: OpenAI and Others Won’t Kill All Application-Layer Opportunities—Put Down Your AI Anxiety
Will OpenAI Kill All AI Applications? a16z: You’re Going Down the Wrong Path.
Author: Joe Schmidt IV
Translated and edited by TechFlow
TechFlow Intro: What’s the biggest anxiety for AI founders? That OpenAI and Anthropic will kill all application-layer opportunities. An a16z partner answers with the “Yellow Brick Road” theory: large-model labs will dominate only horizontal, single-step tasks—the real opportunity lies in vertical domains, multi-step workflows, and highly regulated fields. This article is essential reading for AI founders and investors alike.
I’ve recently been asked the same question repeatedly—by founders and prospective employees alike: Is there anything left to do in the AI application layer, or will OpenAI and Anthropic kill everything?
Beneath this question lies a distinct form of AI anxiety. Some conclude that the only way to avoid permanent irrelevance is either to stay inside a major lab—or go into frontier areas like robotics or hard tech: anything, in theory, “beyond the lab’s reach.” If every piece of software will be consumed—either directly absorbed by Codex or Claude, or rendered unnecessary by future models—then run!
Look, like nearly everyone else, I’m an AI maximalist—and I think they’re half-right. Labs *are* consuming vast swaths of the application surface. But the “application layer” isn’t a monolithic opportunity. The right framing is whether you’re on the Yellow Brick Road—or somewhere else in Oz.
The Yellow Brick Road is our shorthand for the path labs are walking—and pouring massive resources into. Labs excel at problems like code generation, writing, or image creation because those tasks improve directly as raw model capability improves: every dollar spent on pretraining and post-training lifts product quality. Meanwhile, elsewhere in Oz live more complex, often verticalized problems—not as simple as handing business users a horizontal tool plus standard software and computer literacy. Value accrues instead from the scaffolding built around the model: infrastructure that makes outputs trustworthy, compliant, and actionable within a specific industry—not just raw model capability (though that remains critical!).
We’re seeing this play out in real time, as OpenAI and Anthropic effectively tell the market they *can’t* solve all problems with general-purpose AI colleagues. They’ve announced massive front-loaded joint ventures focused on configuring and customizing their models for enterprises—building entire companies around enterprise deployment. You wouldn’t invest billions into these projects if you believed the next model release would solve them.
So if you want to get rich building AI applications—step off the Yellow Brick Road and build elsewhere in Oz. Here’s what we’ve learned—and what some of our portfolio founders have learned—about what works.
The Yellow Brick Road
If you’re launching a company, the Yellow Brick Road is the most obvious path—but also the most dangerous. Take a high-performing model, plug in some off-the-shelf connectors (e.g., G Drive, Slack, Salesforce, Notion, GitHub), and layer on some kind of agent orchestration. It’s magical!
The problem is that’s exactly what labs are doing with Cowork and Codex. Obviously, they own the models—which gives them better margins, control, and pricing power over any downstream player. But perhaps most importantly, they also own the architectural choices that define *what their products are good at solving*. So far, they’ve been cautious about the model-plus-tool-calling pattern—and that’s precisely what low-step-count, horizontal work on the road requires. Even if a startup could somehow outperform Codex or Claude Code, labs possess enormous distribution channels and the largest brand halo in AI.
If you’re an AI application company running this playbook with identical connectors—no sub-agents, no configuration, no distribution—you’re likely headed nowhere.
Elsewhere in Oz
It’s not all doom and gloom for startups. Massive opportunities exist beyond the Yellow Brick Road—places where startups have a clear path to owning their customers and solving complex problems.
These companies are building agent experiences where models are woven into complex networks of tools, automation, and integrations—in other words: software—making these startups inherently vertical by default. They can focus on multi-step, multi-stakeholder workflows, using role- and domain-specific sub-agents—something Anthropic and OpenAI, constrained by their horizontal platforms, simply cannot touch: gathering context across systems, then routing it to multiple people who must approve at different stages. This often involves one or more legacy systems, demands deterministic outcomes, tolerates no ambiguity, and is frequently tied to concrete business outcomes. Labs understand how valuable these problems are—that’s why they’re building their own outsourced configuration stores, and why an entire high-end reinforcement learning services category exists.
Why Elsewhere in Oz Won’t Be Owned by the Wizard
A common rebuttal to the above is that betting against model/lab improvement has historically been a terrible trade. They’ll almost certainly keep getting better—and eventually erode these application-layer service markets.
Labs *will* improve—but I believe “Elsewhere in Oz” offers several durable ways to protect itself over time:
Data and Learning Flywheels:
Much of what you internalize lives outside any training set—unwritten industry norms, undocumented standards, tribal knowledge residing only in practitioners’ heads. None of this lives on the public web. No amount of training compute can substitute for being embedded in the actual workflows where this knowledge resides. There are two overlapping flywheels here: one cross-customer—patterns that accumulate as you see more variants of the same problem—and one intra-customer—the rationale behind specific decisions, unspoken exceptions, and company-specific heuristics, which only surface through real interaction with the system.
Even if customer data can’t be shared cross-customer, application companies can leverage cross-customer patterns in problem types—and use them to architect correct solutions for future problems. A company that’s already run its agents through one hundred legal redlines, one thousand insurance underwriting cycles, or ten thousand SDR campaigns has internalized the shape of the problem in a way no new entrant—even one launching a brand-new agent—can replicate.
Horizontal agents could theoretically build the same learning infrastructure. They don’t—partly due to pure focus, but also user experience: capturing this knowledge depends entirely on the workflow interface you give users, and vertical players can shape those interfaces around what needs to surface for *their* workflows. Horizontal tools can’t. Evaluation sets, labeled outputs, and edge-case taxonomies can accumulate into vertical-specific data flywheels—fueling fine-tuning—while newcomers can’t generate comparable production exposure. Whether this is possible depends on data rights, accumulated production exposure, and contract structure—but pattern recognition accumulates regardless.
Managing Model Variability and Complexity: Labs already route internally—using different model classes for different requests, ensembling under the hood. What they *can’t* do is route across vendors, evaluate competitors’ models for specific subtasks, or deploy open-source fine-tuned models for narrow, optimal segments. Companies “Elsewhere in Oz” pick the right model for each subtask across the *entire* model market—not just what their parent lab releases. They also do the thankless work—re-running evaluations on upgrades, recalibrating prompts for customers’ edge cases, rolling out changes without breaking production—every time a new model launches. Labs won’t do this for customers; they’ll sell you the next model and tell you to migrate. Companies “Elsewhere in Oz” absorb the migration work. Customers get best-in-class intelligence *across* the market—and continuity with every upgrade.
Cost Optimization: Running every query on Opus 4.7 is the fastest path to negative gross margin. The best “Elsewhere in Oz” companies route across model tiers—frontier models for hardest tasks, mid-tier models for bulk work, and smaller custom or fine-tuned models where they’ve earned usage rights. Some now even post-train their own models—optimizing narrowly for workflow segments customers care about—at a fraction of frontier API call costs. Labs price to the bottom line: minimum intelligence for $X. “Elsewhere in Oz” companies sell the opposite—minimum dollar cost for the *specific level of intelligence* the workflow actually requires. That’s only possible when you know *exactly* what each subtask demands—and labs structurally cannot know this across every vertical. This translates directly into lower, controllable outcome pricing.
Governance: Becoming the control plane for how customers run AI in that vertical carries substantial value—where permissions, audits, what agents are *allowed* to do, and what they *actually did*, all converge. That control plane is built from use-case-specific guardrails—radically different across industries and job types. Because they end-to-end own the tools, workflows, and data touched by agents, they deliver deterministic outcomes in ways horizontal tools struggle to match. They’re also the entity absorbing regulatory complexity for the end buyer—FRCP and bar association rules in law, HIPAA in healthcare, SEC and FINRA in finance, state insurance regulations, etc. Horizontal players can’t credibly do this—unless they simultaneously become 100 different verticals. CIOs want a partner who contractsually commits to handling compliance for the agents they provide.
All of this circles back to one thing: focus. It can be a vertical (insurance, law, accounting) or a deeply executed function (sales, customer support, finance). Either way, this work demands a team singularly dedicated to one customer segment—their workflows, edge cases, regulations. Labs aren’t built for this. They must be everywhere, serving everyone—that’s how they built the Yellow Brick Road in the first place. That same trade-off prevents them from entering “Elsewhere in Oz”—you can be everywhere, or excellent at one thing. Not both.
Sales as an Example—Practical Advice from 11x Tech CEO
How should you think about this in practice? Here’s practical advice from 11x CEO Prabhav Jain.
Focus on Outcomes
The tactical path to building a lab-resilient company starts with the *specific outcome your customers truly care about*. For us, that’s helping companies generate more sales leads. From there, the problem becomes tactical. Which activities *actually drive* lead generation—and which ones do we want to own end-to-end? Break each activity into tasks. Which are agentified, and which aren’t? Which require deep domain insight—and which don’t? Labs will ship workflows too—but when workflows involve many steps, messy inputs, hard-to-interpret states, or real-world constraints, better models alone won’t get you there. The work falls to old-fashioned software engineering—and labs hold no advantage over focused application companies on this surface. For example, here are some tasks we handle—some agentified, some not: prospect mining based on custom signals, prospect enrichment, deep account research, pulling context from CRM, channel-specific message writers, lead qualification agents, and email delivery systems. These aren’t one-off tasks—they demand deep engineering.
The key insight from the Oz analogy is that roughly half the non-agentified part of any real workflow confers *no lab advantage*. They’re no better than you at writing deterministic software beneath the model layer. And the agentified half still requires you to tune, train, and constrain models for the *actual outcomes you want*. Domain knowledge rarely lives in general training data. These skills are built from scratch for a vertical or function—and injected into the model at the right moment in the workflow. When our agent qualifies inbound leads on calls, I must train it specifically on good sales conversations for that industry and role. That’s application-company work—and it compounds.
More importantly, these skills constantly *expire*, as businesses evolve. So your ability to keep these workflows and contexts evolving *is* the real competitive advantage. For example, when we launched our scaled email outreach product, “AI-written” emails were just emerging. Fast-forward to today: people now have sharp intuition distinguishing AI- vs. human-written emails—and that intuition shifts every few months. Our agents must continuously adapt to market dynamics—and *that’s* where the moat forms. In fact, despite constant market shifts, our positive reply rate has quadrupled over the past few months—generating hundreds of millions in sales opportunities for customers.
Focus on High-Complexity Problems
High-complexity problems are where real commercial value unlocks. Otherwise, you’re just building a thin wrapper.
Break down any sufficiently complex business problem—and chaos emerges quickly. Here’s a GTM example that sounds simple: “Don’t contact contacts at companies already your customers.” Reality is far messier. Maybe your CRM holds the company’s domain. What about companies with dozens of subsidiaries? What if the CRM record shows the parent company’s domain? What if an outdated matching field in Salesforce sends cold emails to an existing customer’s CRO? Real-world data is chaotic. Humans struggle with it. Models don’t magically leap this gap. Imposing order from chaos requires agents purpose-built for that specific problem shape—not a generic copilot pointed at CRM. In fact, our data shows our data quality and freshness far exceed customers’—so we default to our data.
Guardrails aren’t just about preventing bad things. That’s why customers pay you.
Guardrails are severely underestimated. Even within the same product, every use case demands its own guardrails. For us, a regulated financial-services prospect requires safeguards utterly unlike those for a mid-market SaaS customer. These safeguards permeate how agents draft content, whom they can contact, which data they can access, what they can say on calls—and how every decision is logged.
One-size-fits-all systems collapse under such variance. Guardrails must be built per use case, configured per customer, and audited continuously. This work falls entirely to application companies. That’s why we have full-time deployment engineers (FDEs) and technical deployment strategists—tuned to each customer’s needs. For example, we partnered with a Fortune 1000 institution to conduct opt-in outbound voice calls to their vast SMB customer base. Initial iterations had low answer rates—we had to rapidly iterate, learning how to engage this specific audience within the first 10 seconds of the call. SMB owners behave radically differently from large B2B buyers or consumers. We now generate more sales opportunities for them in a single day than their entire SMB sales team does in a month.
Insurance as an Example—Practical Advice from FurtherAI CEO
Sales is one example. Insurance is another—illustrating the same point from a different angle. Here’s FurtherAI CEO Aman Gour on how to build “off-road”:
When we began deploying AI in real insurance operations, we kept hearing a specific assumption: “The model *is* the intelligence—and the workflow is just scaffolding around it.”
After partnering with more and more insurers, we’ve grown increasingly certain this view is wrong.
In insurance, much of the intelligence lives *in the workflow itself*. Two insurers might route an application through seemingly identical steps: submit, review, quote, underwrite. The path is the easy part. What distinguishes them is everything *inside* that path: which risks require escalation, which loss signals matter, which risk-preference rule takes priority when two conflict, when human sign-off is required, which external data sources to invoke—and how final decisions are recorded.
This logic doesn’t reside in a clean rules engine. It’s scattered across standard operating procedures, manager reviews, underwriting philosophies, company-specific risk appetites, and years of operational experience. Much of it isn’t documented in any format models can read directly.
That’s why we don’t trust purely agent-driven, first-principles reasoning every time—or rigid workflows that collapse upon encountering messy reality. What we’re building is *agentified workflows*. Workflows deliver repeatability, auditability, and cost control. Agents handle variability—and recover when the ideal path breaks. Humans remain in the loop for judgment calls requiring accountability.
Day one, this automates manual work. Over time, every escalation becomes a signal, every exception feedback, every human correction revealing gaps in the operations manual. Gradually, the workflow ceases to be a script—and begins to function as the insurer’s operational memory. This is the part labs struggle to reach. They’ll keep releasing better models and better general-purpose agents—and that’s fine. But they won’t spend enough time inside insurers’ production workflows to understand *why* a particular account was escalated, *why* a risk was declined, or *why* an underwriter overrode the risk-preference guidelines—and was right.
This understanding only comes from running workflows thousands of times in production. The workflow you ship on Day One isn’t the moat. The feedback loops created by production usage *over time* are.
For us, that’s what building “off-road” means.
How to Tell If You’re “Elsewhere in Oz”
Tool-and-Step Test: How many steps does this work require—and how complex a set of tools must you build to support it? Compare horizontal AI search on Google Drive—one step, one tool, high tolerance for error (users skim the summary and re-query if wrong)—with multi-step legal redlining against three years of law-firm precedent: dozens of steps across multiple tools, output requiring partner review—and possibly defensible in court. Both look like “agents doing work,” but only the latter demands deeply engineered software built by a focused team over years.
System Test: Are you building a *system* your customers use to run work—or a *tool* layered atop their existing systems? Systems own the workflow end-to-end—data capture, governance, completion logging—they’re what customers point to when describing how real work actually happens. Tools merely add intelligence to workflows customers already run. Tool scenarios generate real revenue—but labs can steal them, because customers don’t depend on you as the orchestration layer. High ACV is often a signal of a *system*, since systems replace real humans—and get paid accordingly—but it’s not guaranteed. Ask yourself: if a lab launched something that *directly competes* with you, would customers still need your tool? If yes—you’re building a system. If no—you’re a tool—even with high ACV.
Hedge-Fund / P&L Test: Labs are judged on benchmark scores; “Elsewhere in Oz” is judged on customers’ P&Ls. Your customers don’t care how your model scores on SWE-Bench or MMLU—they care whether your agent closed the deal, correctly redlined the contract, or underwrote the right policy. If they care about outcomes for a *specific workflow*, not general capability scores—you’re “Elsewhere in Oz.” If they pay for general capability, they’ll get it via a Claude or Codex subscription. The best agent businesses must perform like hedge funds—winning with alpha in the customer’s P&L, not benchmark scores.
Both Can (and Will) Win
We’ll see massive winners both *on* the Yellow Brick Road—and off it. Models will keep winning, because they own the models—and the distribution channels for their horizontal tools.
“Elsewhere in Oz” can win—if they own the *work system*: the interface where companies actually execute work—and the data flowing through and captured by it. These companies own data capture, workflow action systems, and governance. As more complex workflows mature in verticals, they compound into a core experience customers rely on. When next-gen models launch—from incumbents and newcomers alike—these companies become the layer that integrates and delivers them to customers. The underlying models are swappable; the work system is not.
The next generation of enterprise software will be built off-road.
If you’re building it, reach out: jschmidt@a16z.com.
Join TechFlow official community to stay tuned
Telegram:https://t.me/TechFlowDaily
X (Twitter):https://x.com/TechFlowPost
X (Twitter) EN:https://x.com/BlockFlow_News











