Full post-mortem: How was Manus created?

2025.03.12

Full post-mortem: How was Manus created?

"Agent might be an 'alignment' issue, rather than a foundational model capability issue."

2025.03.12 - 09:20:54

Manus

Navigating Web3 tides with focused insights

"Agent might be an 'alignment' issue, rather than a foundational model capability issue."

Author: Wan Chen

The most inspiring startup story I encountered last year came from Zhang Luyu, founder of Dify.

I first met him at the 2023 "Xixi Dialogue" event. Among a constellation of prominent names, "Zhang Luyu" stood out as unremarkable. When I saw him again in 2024, Dify had become another story entirely—a founder without an elite background had somehow created one of the world’s most successful open-source AI products amid widespread skepticism about business models.

What unfolded at this company over just one year—such as unexpectedly gaining traction in Japan’s traditionally conservative and hard-to-penetrate market—deepened my understanding of what “startup” really means. It's full of surprises, requires luck, and ultimately demands the ability to forge a path through constant change and disappointment.

Now, a similar narrative is unfolding around another highly watched entrepreneur—Xiao Hong of Manus.im and his team.

Four months ago, Xiao Hong shared a dilemma: “Our team excels at going from 0 to 1, with strong opportunity-grabbing capabilities. But once we move from 1 to N, our performance isn’t as strong.”

In his past ventures, his startups achieved relatively stable and substantial revenue, and his previous company was successfully acquired. In 2023, his new company “Butterfly Effect” launched Monica.im, a browser extension that carved out a unique position during the intense AI model race, becoming one of the fastest-growing AI applications with exceptional product experience. On the surface, he appeared to be a consistently successful entrepreneur—and all by age 32.

Yet, he didn’t feel much satisfaction. To Xiao Hong, the so-called “serial exit entrepreneur,” the thrill of repeatedly going from 0 to 1 felt like living inside a gilded cage—the rush of seizing opportunities was exhilarating, but it also brought anxiety: would he have to do it all over again?

In 2024, industry observers believed that AI assistants like Monica.im, especially those with memory features, would face stiff competition from heavyweights like Doubao (Doubao), making growth far harder than in 2023. While Monica.im had nailed the 0-to-1 phase, it wasn’t clear whether it could scale from 1 to N.

His uncertainty stemmed from the realization that “the team must now tackle harder problems with higher ceilings”—exploring paths capable of bridging the gap from 1 to N.

Earlier, many assumed this “harder, higher-ceiling” project would be the long-rumored but迟迟 unreleased AI browser.

Now it’s clear: everyone guessed wrong.

This more challenging journey actually involved: abandoning an already-developed AI browser, searching for the next “ChatGPT moment” in AI products, identifying general-purpose agents as the goal, and building the newly launched Manus.im.

How innovative is Manus? What level can it achieve? These questions have instantly become explosive topics. But what remains truly worth examining is still the direction found amidst adversity—and how that direction emerged. Manus.im may not enable this team to master the 1-to-N transition, nor replicate Monica.im’s momentum. Yet much like the company’s name—“Butterfly Effect”—small actions and decisions can unintentionally shape the future in profound ways. “Connect the Dots”—tomorrow’s path is hidden within today’s experiences.

01 Manus’ Unique Product Experience: Lessons from Building an ‘AI Browser’

Since mid-to-late 2024, Butterfly Effect’s work on an AI browser had been an open secret in the industry. The official public debut, however, came with the runaway sensation: Manus.

If you’ve tried Manus firsthand or watched its demo videos, you’ll notice a key difference compared to chatbots or other agent-like apps: Manus can execute tasks asynchronously and in parallel.

When using apps like Doubao, Kimi, or Computer Use-like tools, asking a question requires waiting for a response. If you interrupt during processing, the ongoing task gets canceled—you’re locked into an A-B-A-B relay-style conversation.

With Manus.im, despite retaining the appearance of a chatbot interface, you can submit 20 different tasks simultaneously. You’re free to continue working—watching videos, writing documents, gaming—while Manus works independently. As soon as any task completes or encounters issues, Manus notifies you. If you spot flawed reasoning mid-task, you can jump in via the chatbox with additional prompts, and Manus will adjust accordingly with updated context.

The experience is asynchronous, parallel—like having a team of real human interns doing your bidding.

In reality, Manus’ asynchronous architecture stems directly from lessons learned while developing their previous undisclosed product: the AI browser. This insight was also why the team, after significant investment, decided in October 2024 to halt development on the browser.

The Browser Company announced on October 25, 2024, it would cease new feature development for Arc browser, redirecting resources toward a new browser called Dia, aiming to build a simpler, more user-friendly AI browser. | Source: Arc official website

“In an AI browser, the AI constantly interrupts the user.” Designed for single-user scenarios, when the AI takes over, you’re sidelined. Once the AI starts working, you can only watch passively. Watching the AI hijack your mouse and computer, you dare not reclaim control, fearing a keystroke or mouse movement might crash the entire process and force a restart.

This led the team to two conclusions:

Directly using AI to operate your local computer (Computer Use) is impractical in the short term.
AI should use a browser—but not yours. It needs its own browser, preferably in the cloud, which then returns results to you.

In an interview with Tencent Tech’s Zhang Xiaojun, Xiao Hong mentioned that when reviewing product evolutions—from Jasper to ChatGPT, Monica, Cursor, to Devin—the team realized “human programmer” Devin closely embodied this asynchronous architecture.

Unlike Windsurf, which sometimes asks users to confirm installing libraries on their machine, or pauses execution requiring a “yes/no” input because it risks damaging the system or causing conflicts—essentially passing responsibility—the ideal agent shouldn’t interrupt.

Thus, the Manus team concluded: “There should be a virtual computer in the cloud for the chatbot, where it runs code and browses websites. Since it’s a virtual server, if it breaks, no problem—just spin up a new one. Even better, release that server instance once the current task finishes.”

Notably, while Devin targeted vertical, hardcore engineering use cases, the Manus team chose a general-purpose, consumer-grade AI assistant—with both web and app versions. It’s a universal AI assistant that follows instructions, invokes tools, and completes diverse tasks across work and life, eventually delivering results at a consumer-affordable price point.

02 Less Structure, More Intelligence

With a clear vision in place, the next step was execution. How did Manus achieve this?

According to Zhang Tao, Manus’ product partner, realizing this vision required equipping large language models with a virtual computer, granting them system-level permissions (access to private APIs like code repositories and specialized data query sites), plus targeted training.

This setup allows AI to autonomously use the virtual computer to open browsers, trigger tool usage, observe real-world impacts of its actions, reflect, act again, observe again—and so on. This iterative loop enables AI to explore and complete complex tasks. Over time, Manus becomes increasingly attuned to your preferences through interaction. Eventually, even without explicit instructions, it can infer intent based on accumulated knowledge from past tasks—anticipating your needs.

Li Bojie, Huawei’s “prodigy talent” and founder of Logenic AI, believes Manus has a standout quality: solving problems the way an elite hacker-programmer would. | Image source: WeChat screenshot

The core philosophy behind Manus crystallized during the team’s development process: Less Structure, More Intelligence (fewer constraints, greater autonomy).

This principle sparked repeated “A-Ha, Wait!” moments for the team. For example, in January this year:

The team asked Manus to solve a question from the GAIA benchmark: “In a National Geographic-style YouTube video showing penguins walking in and out of frame, determine the maximum number of distinct penguin species visible simultaneously in a single frame.”

Then, something magical happened.

Manus opened the video link and pressed 'K'—its first action. Then it systematically took screenshots, noting which species appeared in each frame. After analysis, it determined the peak count was three species. To verify, its next action was pressing '3'… Finally, it confirmed the answer: 3.

As creators, they should have known Manus’ limits—but instead, they were stunned. Not only did Manus get the answer right, but humans who’ve used YouTube daily might not even know what keys like 'K' or '3' do.

The team followed along: 'K' is the keyboard shortcut to pause playback, allowing precise frame-by-frame capture. '3' jumps to 30% of the progress bar—an exact timestamp shortcut—enabling pinpoint navigation to validate findings.

“This process differs fundamentally from traditional chatbots. First, it sees the actual video frames—not subtitles. Second, we discovered it used native YouTube shortcuts. We were shocked—it solved the problem correctly,” Xiao Hong said in a prior Tencent Tech interview.

All of a sudden, they realized Manus wasn’t just better than humans at programming—it possessed unimaginably deep knowledge of everyday web and app interfaces. As an omniscient AI, it knows every trick and tool across platforms and selects optimal solutions.

Once again, the team experienced “Less Structure, More Intelligence”—minimizing artificial constraints and letting AI evolve and operate freely, rather than prescribing rigid workflows.

At the very bottom of the Manus homepage, almost hidden, lies Manus’ most important insight: 'Less Structure, More Intelligence'. | Screenshot source: Manus

This is how Peak, co-founder and chief scientist of Butterfly Effect, explained and expanded upon the foundational principle behind Manus on launch day:

When your data is high-quality, your model intelligent enough, your architecture flexible, and your engineering solid, capabilities like Computer Use, Deep Research, and Coding Agent stop being discrete product features—they emerge naturally.

Returning to first principles reshapes our view of product design:
• An AI browser isn’t adding AI to a browser—it’s building a browser for AI;
• AI search isn’t retrieving indexed results and summarizing—it’s empowering AI to access information with user-level permissions;
• GUI operation isn’t seizing control of user devices—it’s giving AI its own virtual machine;
• Code generation isn’t the end goal—it’s a universal medium for solving problems;
• Website creation isn’t about scaffolding frameworks—it’s about generating meaningful content;
• Attention isn’t all you need—freeing human attention redefines DAU;
• ····

Through repeated discovery and application of “Less Structure, More Intelligence,” Manus delivered results exceeding expectations—including surpassing OpenAI Deep Research’s pass@1 score on the GAIA benchmark under cons@64 conditions. Internally, Manus also covered 76% of dedicated agent use cases among Y Combinator W25 startups.

03 “Agent Challenges May Be About Alignment, Not Base Model Capability”

These insights are now sparking broader discussions:

Clement Delangue, Hugging Face founder and CEO, wrote on X that Peak’s observation deserves attention: agent capability isn’t bottlenecked by base models, but by alignment—similar to the difference between GPT-3 and InstructGPT (ChatGPT). Many open-source base models are simply trained to “answer fully in one turn regardless of complexity,” optimized for chatbot use. But slight post-training adjustments tailored to agent workflows can make a dramatic difference. | Screenshot source: X

Manus does not adopt MCP (Model Context Protocol), instead enabling AI to write code and call APIs directly to handle diverse long-tail tasks. | Screenshot source: X

In recent days of discussion around Manus, the most frequently asked question has been: Can a general-purpose AI agent succeed—and where are its limits?

Peak believes human interaction with the world follows standard patterns—eyes, hands, ears. If the action space (action space) is well-defined, an agent should seamlessly replace humans in existing workflows.

If humans can use various tools to perform deep, domain-specific operations, then given sufficient knowledge, proper training, and a robust interface to interact with the world, an agent should be able to work similarly—even operating SaaS products. For instance, one housing search demo shown on Manus.im involves the AI using a real estate-specific SaaS platform.

He emphasizes the boundary should be defined by the tools an agent can use, not the user segment it serves. Manus isn't simulating a specific role like developer or product manager; it's mimicking a capable person—a smart intern.

Manus’ multi-agent system refers to the separation between planning (planning) and execution (execution).

For the executor (Executor), Manus leverages Claude for its leading-edge capabilities in programming, long-horizon planning, and step-by-step problem-solving, while also fine-tuning Qwen series models.

Yesterday, Manus announced a strategic partnership with Alibaba’s Tongyi Qianwen, aiming to deliver full functionality on domestic models and computing platforms. | Image source: Manus

The planner (Planner) component is where Manus invests heavily.

Current off-the-shelf APIs and models are fundamentally aligned for chatbot scenarios: no matter how complex the user query, the optimization goal is to resolve it within a single response. This contradicts the iterative, multi-step nature of agent planning.

Using existing models directly in agent contexts creates misalignment—the model rushes to produce a vague, bullet-point-style output in one go, rather than thoughtfully planning steps.

“The alignment method must differ. Our team believes specialized data and alignment training are needed,” said Xiao Hong.

In October last year, Peak documented on Zhihu his failed attempt to reproduce OpenAI o1 via an open-source project called Steiner—an effort that was essentially early R&D into Manus’ step-by-step planning module.

Overall, Manus simulates a competent human worker—this is the team’s definition of a general-purpose AI assistant. Its boundaries remain under exploration, requiring more real-world usage data.

Prior to Manus’ launch, in a Tencent Tech interview, Xiao Hong shared preliminary thoughts on its generality: “A core challenge—or a key product management duty—is managing user expectations. Assuming it can do everything—like telling me how to earn $1 million—is unreasonable. But if we provide concrete examples, align expectations, and guide usage, people will find it intuitive.”

04 “Shells Have Their Purpose”—The Team That Understands Shells Best

In the early hours of February 27, Manus’ product partner Zhang Tao and chief scientist Ji Yichao (Peak) both shed tears upon seeing Manus’ benchmark results. Manus surpassed OpenAI’s Deep Research on the GAIA Benchmark—achieving this remarkable result at roughly 1/10 the cost per task (~$2) used by OpenAI.

Image source: Manus.im

A small team of dozens, entering the agent race just as the industry reached consensus, became one of the first to ship a general-purpose agent—with distinctive strengths in product engineering and front-end UX.

Positive reinforcement from achieving something real outweighs everything. For a startup team, there’s no better motivator. But how did Manus come to be? Why this team?

“Today’s models are already capable of completing complex, multi-step tasks. There just aren’t products showcasing this—so people don’t realize it,” Xiao Hong observed in a prior Tencent Tech interview—an insight that explains the breakthrough.

Moreover, few teams have the chance to build agent products. It requires rare composite skills: experience with chatbots, AI coding tools, browser technologies (since agents need to invoke browsers), and a sharp sense of LLM capabilities—knowing exactly where models stand and where they’re headed. Few companies possess all these skills; those that do are often too busy with core businesses. We happened to have teammates available to pursue this.

“Happened to.”

Finding the right moment when model capabilities matured enough for agents—without waiting for an end-to-end Operator-like model;
Recognizing the core issue was alignment;
Having built both chatbot extensions and AI browsers;
Maintaining acute awareness of LLMs through continuous “shell-building” on top of large models;

The Butterfly Effect team assembled all necessary ingredients to build a general-purpose agent at this moment—resulting in one of the most polished agent products in the industry.

When asked about the pivotal moment deciding to build Manus, Peak provided deeper context: “Startup pivots are never clean—they’re continuous, with no clear boundaries.”

“While building one product, we constantly monitor external developments.” At the time, several things converged: while building the browser, they experimented with on-device models, but realized browser use cases were extremely broad. During development, they noticed base models were improving rapidly—so fast that the gap between models and agents might just be an alignment issue, even as others believed LLMs were plateauing.

Externally, trends were shifting. Early 2024 saw Cursor gain popularity, followed by Windsurf and Devin—each representing progressive advancements in agentization within programming. Cursor acted as a copilot boosting coding efficiency; Windsurf introduced automation workflows enhancing local machine capabilities; Devin pushed automation further.

VC trends aligned too: over the past two years, YC invested in two types of companies—one being cloud-based browsers like Browserbase; the other lightweight AI sandbox VMs like e2b.

This signaled that model infrastructure and underlying infrastructures were maturing rapidly. Seeing growing market acceptance of such products, they felt this was a direction worth going all-in on. It was a smooth, gradual realization—made possible by reusing Chromium-based infrastructure developed during browser efforts. This foundation gave them the confidence to build a browser in the cloud.

In summary, acute sensitivity to user needs and model capabilities forged through “shell-building,” combined with accumulated experience, made Manus possible. Monica required extensive model fine-tuning; AI browser experiments reinforced the critical lesson of “less structure, more intelligence”; they recognized models were ready for agents—the missing piece was alignment. Then came Manus’ rapid 3-month evolution.

Previously, Butterfly Effect faced skepticism about the value of “shell” products. Without developing their own LLMs, they built Monica by integrating existing models, combining chat, search, reading, writing, translation, and numerous API-connected task scenarios—reaching tens of millions of users by year-end.

Now, as Doubao, Quark, and Yuanbao aggressively promote their own Monica-like offerings, and a small team leverages existing tech to launch the first general consumer-grade agent, it’s time to rethink what a ‘shell’ truly is.

What exactly is “shell-building”? What is a “shell”?

To Xiao Hong, all breakthroughs originate from models—innovation is model-led. A shell exists to present model-level innovations in a user-perceivable way, packaging cutting-edge model capabilities into forms users can intuitively grasp.

By this definition, the DeepSeek App (including chain-of-thought display) is the shell for DeepSeek-R1; Cursor is the shell for Anthropic Sonnet 3.5; Perplexity is GPT-4’s shell; ChatGPT is InstructGPT’s shell.

As model capabilities evolve rapidly, “the shell” must evolve too. After each leap in model capability, it’s often not the original maker—but a third party—that best translates its value into user experience. Just as Cursor showcased the user-facing value of Claude 3.5 Sonnet.

On March 5—the second anniversary of Monica.im’s launch—the answer to why this small team delivered a product experience surpassing Deep Research and OpenAI Operator lies precisely in their understanding and mastery of shells.

How do you build the best possible shell for a new agent-capable model?

As a builder of Manus, Zhang Tao says: “Looking behind the scenes at the full architecture, we see countless unfinished components—each one a potential game-changer, each shaping the final product in fundamental ways.”

To the team, the biggest advantage is innovation speed. Whether in applications or models, the field is approaching saturation. The ultimate competitive edge? Move faster. Even if data flywheels and network effects remain unproven, speed wins.

“In a completely new domain, everything is undefined, everything unknown. Speed of innovation matters most—exploring widely, failing fast, and quickly finding the right path.” The Manus team’s management philosophy, organizational structure, and industrial processes are exceptionally agile. When new opportunities arise, they can mobilize company-wide resources efficiently, make ultra-fast decisions, and adapt rapidly based on feedback—even from mistakes.

From left to right: Peak (Chief Scientist), Xiao Hong (CEO), Zhang Tao (Product Partner) of Butterfly Effect | Image source: Internet

Regarding expectations for Manus, Xiao Hong says: “Even if there’s only a narrow window, it’s worth trying.” Over the past year, his thinking has shifted dramatically. Today, he believes: “When you realize you’re ahead, be bolder—radically bold. Looking back, I think we weren’t bold enough with Monica in 2023. If you know you’re innovating and leading, you should go all-in.”

We don’t yet know whether Manus will grant Xiao Hong and his team the 1-to-N breakthrough they seek. But this team—the ones who understand shells best—believes in creating with mind and hand united, and in the butterfly effect unleashed by creation. Manus derives from MIT’s motto: Mens et Manus—mind and hand. True knowledge isn’t just theory—it’s action that impacts the real world.

Going forward, as more of Manus’ underlying technology opens up, an even wider butterfly effect will unfold.

Join TechFlow official community to stay tuned

Telegram:https://t.me/TechFlowDaily

X (Twitter):https://x.com/TechFlowPost

X (Twitter) EN:https://x.com/BlockFlow_News

Source

Add to Favorites

Share to Social Media

Author

极客公园