
Everything You Need to Know About GPT-5.5: Starting Today, OpenAI “Sells No Tokens”
TechFlow Selected TechFlow Selected

Everything You Need to Know About GPT-5.5: Starting Today, OpenAI “Sells No Tokens”
Bigger, more expensive, and smarter—some say losing it feels like an amputation.
Author: Helen
On April 23 local time, OpenAI officially launched its next-generation flagship model, GPT-5.5, positioning it as “a new tier of intelligence designed for real-world work”—a pivotal step toward a fundamentally new way of computing.
Two core aspects defined this launch:
First, a breakthrough in efficiency: At the same latency, the model is larger—but no slower. GPT-5.5 features a context window of up to 1 million tokens—not merely an incremental upgrade over GPT-5.4, but a leap in intelligence delivered at identical latency.
Second, during training, GPT-5.5 actively participated in optimizing its own inference infrastructure. In short, AI has, for the first time, learned to tune its own parameters.
In Terminal-Bench 2.0—a benchmark testing complex command-line workflows—GPT-5.5 scored 82.7%, surpassing Claude Opus 4.7’s 69.4% by over 13 percentage points. In OSWorld-Verified—which evaluates AI’s ability to independently operate a real computer—the success rate reached 78.7%, exceeding the human baseline. In GDPval—a benchmark spanning 44 professional knowledge domains—GPT-5.5 achieved or exceeded expert-level performance on 84.9% of tasks.
However, GPT-5.5’s pricing has also risen significantly.
Its API is priced at $5 per million input tokens and $30 per million output tokens—double that of GPT-5.4 ($2.50 input / $15 output per million tokens). Yet OpenAI emphasizes that GPT-5.5 requires far fewer tokens to complete the same tasks, so overall cost may not rise substantially. The GPT-5.5 Pro API is priced at $30 per million input tokens and $180 per million output tokens. Bulk processing and flexible pricing receive a 50% discount; priority processing costs 2.5× the standard rate.
Within ChatGPT, GPT-5.5 is rolling out as “GPT-5.5 Thinking,” gradually replacing earlier versions.
A small but notable new feature: Before beginning to think, the model now provides a brief overview of its reasoning approach—users can interject at any point during execution to redirect or refine the process.
To summarize GPT-5.5’s significance in one sentence: Previous models were collections of capabilities; GPT-5.5 is closer to a working system—one that plans, verifies, and persistently drives tasks forward.
01 84.9% of Tasks Meet or Exceed Professional Standards
GPT-5.5 vs. competitors across key benchmarks: Terminal-Bench 2.0, GDPval, and OSWorld-Verified
Let’s first examine how models perform in realistic professional settings. OpenAI introduced a benchmark called “GDPval,” which requires models to execute full professional workflows. It spans 44 occupational scenarios—including financial modeling, legal analysis, data science reporting, and operations planning.
Results show that GPT-5.5 meets or exceeds industry professional standards on 84.9% of tasks. By comparison, GPT-5.4 achieves 83.0%, Claude Opus 4.7 reaches 80.3%, and Gemini 3.1 Pro scores only 67.3%.
This gap extends beyond aggregate scores. In spreadsheet modeling, internal tests place GPT-5.5 at 88.5%; it likewise leads prior generations on investment-banking–level modeling tasks. Early testers’ feedback is consistent: GPT-5.5 Pro delivers markedly improved comprehensiveness, structure, and practicality versus GPT-5.4 Pro—especially in business, law, education, and data science.
Numbers alone can numb the senses—so OpenAI pulled back the curtain on its own workplace.
OpenAI reports that over 85% of its employees use Codex weekly across departments including finance, communications, marketing, product, and data science. The communications team used it to analyze six months of speaking engagement requests, building an automated triage pipeline; the finance team reviewed 24,771 K-1 tax forms—totaling 71,637 pages—finishing two weeks ahead of last year; the market expansion team saved 5–10 hours per person per week via automated weekly reporting.
This isn’t a lab demo—it’s daily operational reality.
02 The Most Capable Autonomous Programming Model
OpenAI states that GPT-5.5 is currently its most capable autonomous programming model.
On Terminal-Bench 2.0—which tests complex command-line workflows requiring planning, iteration, and tool coordination—GPT-5.5 scores 82.7%, up nearly 8 percentage points from GPT-5.4’s 75.1%, while consuming fewer tokens. On SWE-Bench Pro—which assesses one-shot resolution of real GitHub issues—GPT-5.5 scores 58.6%. And on OpenAI’s internal Expert-SWE evaluation—measuring long-duration programming tasks with a median human completion time of ~20 hours—GPT-5.5 again outperforms GPT-5.4.
Terminal-Bench 2.0 and Expert-SWE scatter plots
Powered by GPT-5.5, Codex can now independently execute full development cycles—from code generation and functional testing to visual debugging—starting from a single-line prompt.
Official OpenAI demos illustrate this capability: A space mission application built on real NASA orbital data supports 3D interactive manipulation, with orbital mechanics simulated to true physical fidelity; an earthquake tracker ingests live data feeds and renders dynamic visualizations—demonstrating full capability to call external APIs, process streaming data, and render outputs in real time.
User feedback underscores these advances. Dan Shipper, Founder & CEO of Every, shared an experience: He’d spent days troubleshooting a post-launch bug before bringing in his company’s top engineer—who rewrote part of the system. With GPT-5.5, Shipper ran an experiment: feeding the model the exact state where the bug remained unfixed, to see if it could arrive at the same solution as the engineer. GPT-5.4 failed; GPT-5.5 succeeded. He remarked: “This is the first programming model I’ve used that truly exhibits conceptual clarity.”
An NVIDIA engineer put it more bluntly: “Losing access to GPT-5.5 feels like an amputation.”
Michael Truell, Co-Founder & CEO of Cursor, added: “GPT-5.5 is smarter and more resilient than GPT-5.4—capable of sustaining focus longer on complex, extended tasks—the very trait engineering work demands most.”
03 Knowledge Work: AI Can Now Truly “Use” a Computer
In OSWorld-Verified—which tests whether models can independently operate a real computer environment—GPT-5.5 achieves a 78.7% success rate, up from GPT-5.4’s 75.0% and ahead of Claude Opus 4.7’s 78.0%.
This isn’t screenshot analysis—it’s genuine screen interaction: seeing the interface, clicking, typing, switching between tools, until task completion. For the first time, GPT-5.5 makes users feel AI can truly share their computer with them.
Financial modeling demo video
On Tau2-bench—a telecom customer service workflow test—GPT-5.5 achieves 98.0% accuracy without prompt tuning, compared to GPT-5.4’s 92.8%.
This reflects deeper intent understanding—enabling handling of intricate multi-step dialogues without meticulously engineered prompts.
For tool search capability, GPT-5.5 scores 84.4% on BrowseComp, rising to 90.1% for GPT-5.5 Pro—indicating strong sustained retrieval and information integration capacity for research tasks requiring synthesis across multiple sources.
04 Scientific Research: Assisting Discovery of New Mathematical Proofs
Perhaps the most surprising aspect of this release is GPT-5.5’s performance in scientific research.
Historically, AI’s role in science has been largely “assistive”—literature search, code writing, data organization. This time, its role shifts meaningfully forward into core activities: complex reasoning—and even discovery itself.
On GeneBench—a multi-stage evaluation of genetic and quantitative biology data analysis—GPT-5.5 scores 25.0%, versus GPT-5.4’s 19.0%. These tasks typically represent days of expert work, demanding inference under near-zero supervision—identifying potentially erroneous data, navigating hidden confounders, and correctly applying modern statistical methods.
As shown in the chart, GPT-5.5’s score improves more steeply than GPT-5.4’s as output token count increases—and pulls away notably around 15,000 tokens. This means GPT-5.5’s advantage amplifies with task complexity, especially for deep-reasoning, long-horizon tasks.
On BixBench—a real-world bioinformatics and data analysis benchmark—GPT-5.5 scores 80.5%, ahead of GPT-5.4’s 74.0%, ranking among the highest-scoring publicly released models.
What truly drew attention was a concrete case: An internal version of GPT-5.5, equipped with a custom tool framework, assisted in discovering a new mathematical proof concerning Ramsey numbers—and formally verified it using the Lean theorem prover. Ramsey numbers lie at the heart of combinatorics; breakthroughs here are exceptionally rare and technically formidable. This wasn’t AI providing code or explanation—it contributed an original mathematical argument.
Real-world applications are equally compelling. Derya Unutmaz, Immunology Professor at Jackson Laboratory, used GPT-5.5 Pro to analyze a gene expression dataset comprising 62 samples and nearly 28,000 genes—generating a detailed research report and distilling key findings and open questions. He noted such work normally requires teams months to complete.
Bartosz Naskręcki, Assistant Professor of Mathematics at Adam Mickiewicz University in Poznań, built an algebraic geometry application in just 11 minutes using a single prompt and Codex’s GPT-5.5—visualizing the intersection curve of two quadratic surfaces and converting it into a Weierstrass model. Real-time coefficient display on the right enables direct use in further mathematical research—entirely generated, from prompt to executable research tool, by the model alone.
Screenshot of Prof. Bartosz Naskręcki’s algebraic geometry app—visualization of intersecting quadratic surfaces and real-time Weierstrass equation computation
Brandon White, Co-Founder of Axiom Bio, offered a stark assessment: “If OpenAI sustains this momentum, the foundations of drug discovery will shift before year-end.”
05 Reasoning Efficiency: AI Optimized Its Own Infrastructure for the First Time
One easily overlooked detail of this release may be the most significant technical advancement.
GPT-5.5 is a larger, more powerful model—yet its per-token latency in production matches GPT-5.4’s. To sustain equivalent latency despite greater capability, OpenAI redesigned its entire inference system—and both Codex and GPT-5.5 directly participated in that optimization.
The Artificial Analysis Intelligence Index chart illustrates this clearly: The x-axis shows total output tokens (log scale); the y-axis shows composite intelligence score. GPT-5.5’s curve dominates GPT-5.4, Claude Opus 4.7, and Gemini 3.1 Pro Preview—not only in peak score, but crucially, in achieving those scores at lower token consumption. Greater capability at lower cost—that is the tangible manifestation of “efficiency gain.”
Artificial Analysis Intelligence Index line chart
Specifically, the team tackled load balancing: Previously, requests were split into fixed-size chunks to balance GPU workload—but static chunking isn’t optimal for all traffic patterns. Codex analyzed weeks of production traffic data and wrote a custom heuristic algorithm, boosting token generation speed by over 20%.
GPT-5.5 was co-designed, co-trained, and co-deployed with NVIDIA GB200 and GB300 NVL72 systems. In other words, this generation of model directly optimized the very infrastructure running it—not metaphorically, but literally: “AI improved the system running itself.”
06 Cybersecurity: Enhanced Capabilities, Tighter Controls
GPT-5.5 demonstrates clear gains in cybersecurity capability. On CyberGym, it scores 81.8%, up from GPT-5.4’s 79.0% and Claude Opus 4.7’s 73.1%. In internal Capture-the-Flag (CTF) challenges, GPT-5.5 scores 88.1%, versus GPT-5.4’s 83.7%.
CyberGym bar chart and CTF challenge scatter plot
OpenAI classifies GPT-5.5’s cybersecurity and bio/chemical capabilities as “High” under its Emergency Preparedness Framework—still below “Critical,” but a definitive upgrade from prior generations. It also acknowledges that newly deployed stricter risk classifiers “may initially feel inconvenient to some users,” and commits to ongoing refinement.
To balance defense needs with access constraints, OpenAI launched the “Cybersecurity Trusted Access” program: Qualified security researchers and critical infrastructure defenders may apply for relaxed access permissions—enabling smoother use of advanced cybersecurity capabilities.
The underlying logic is clear: Capabilities like cybersecurity—and even biological or chemical ones—are subject to near-inevitable technological diffusion. Rather than attempting to restrict universal access, a better strategy is to ensure those doing defense get first access to the most advanced tools. In short, this isn’t about *whether* to open access—it’s about *who gets access first*.
Join TechFlow official community to stay tuned
Telegram:https://t.me/TechFlowDaily
X (Twitter):https://x.com/TechFlowPost
X (Twitter) EN:https://x.com/BlockFlow_News











