DeepSeek's "server busy" error is driving everyone crazy—what's really going on behind the scenes?

2025.02.17

DeepSeek's "server busy" error is driving everyone crazy—what's really going on behind the scenes?

Stuck in the card.

2025.02.17 - 09:50:26

DeepSeek

Navigating Web3 tides with focused insights

Stuck in the card.

Image source: Generated by Wujie AI

Users around the world are frustrated by DeepSeek’s frequent response of "Server busy, please try again later."

Prior to this, DeepSeek was relatively unknown to the general public but rose to prominence on December 26, 2024, with the launch of its language model V3, designed to compete with GPT-4o. On January 20, DeepSeek released R1, a language model targeting OpenAI's o1. The high quality of answers generated through its "deep thinking" mode, along with innovative approaches suggesting a significant reduction in training costs, propelled the company and its application into the mainstream spotlight. Since then, DeepSeek R1 has consistently faced congestion, with its web search functionality intermittently failing and the deep thinking mode frequently displaying "server busy," causing considerable inconvenience for many users.

About ten days ago, DeepSeek began experiencing server outages. On January 27 at noon, DeepSeek’s official website repeatedly displayed messages stating “DeepSeek webpage/API unavailable.” That day, DeepSeek became the most downloaded iPhone application during the weekend period, surpassing ChatGPT on the U.S. download charts.

On February 5, just 26 days after launching its mobile app, DeepSeek surpassed 40 million daily active users. For comparison, ChatGPT’s mobile app had 54.95 million daily active users, meaning DeepSeek reached 74.3% of ChatGPT’s level. Almost simultaneously as DeepSeek exhibited a steep growth curve, complaints about server congestion poured in. Users worldwide started encountering frequent crashes after asking just a few questions. Various alternative access methods emerged—substitute websites for DeepSeek, services launched by major cloud providers, chipmakers, and infrastructure companies, along with numerous personal deployment guides circulating online. Yet, user frustration remained unabated: although nearly all major global vendors claimed support for deploying DeepSeek, users continued reporting unstable service.

What exactly is happening behind the scenes?

1. People used to ChatGPT can't stand an inaccessible DeepSeek

The dissatisfaction with "DeepSeek server busy" stems from prior experiences with top-tier AI applications like ChatGPT, which rarely suffer from lag or downtime.

Since OpenAI launched its service, ChatGPT has experienced only a few P0-level (most severe) outages. Overall, it has proven relatively reliable, striking a balance between innovation and stability, gradually becoming a critical component akin to traditional cloud services.

ChatGPT does not experience widespread outages frequently.

ChatGPT's inference process is relatively stable, consisting of two steps: encoding and decoding. During encoding, input text is converted into vectors containing semantic information. In the decoding phase, ChatGPT uses previously generated text as context to generate the next word or phrase via the Transformer model until a complete, desired sentence is formed. The large model itself belongs to a Decoder architecture, where the decoding stage involves sequentially outputting tokens (the smallest unit of text processing in large models). Each time you ask ChatGPT a question, an inference cycle begins.

For example, if you ask ChatGPT, "How are you feeling today?" it first encodes the sentence, generating attention representations at each layer. Based on these attention representations from all previous tokens, it predicts the first output token "I." Then, during decoding, "I" is appended to "How are you feeling today?" forming "How are you feeling today? I," creating new attention representations. It then predicts the next token: "feel," and continues this loop until producing the full sentence: "How are you feeling today? I feel great."

Kubernetes, the container orchestration tool, acts as ChatGPT’s "behind-the-scenes conductor," responsible for scheduling and allocating server resources. When user traffic exceeds the capacity of Kubernetes' control plane, it leads to a complete system failure across the entire ChatGPT platform.

Although ChatGPT has suffered relatively few total outages, this reliability is backed by immense underlying resources. Maintaining such stability requires powerful computing capabilities—a fact often overlooked.

Generally, inference processes involve smaller data scales than training, so their computational demands are lower. Industry experts estimate that during normal large model inference, memory usage is dominated by model parameter weights, accounting for over 80%. In reality, the default model sizes embedded within ChatGPT are smaller than DeepSeek-R1’s 671B parameters. Combined with ChatGPT’s significantly greater GPU computing power compared to DeepSeek, it naturally delivers more stable performance than DS-R1.

Both DeepSeek-V3 and R1 are 671B-parameter models. Model startup equates to the inference process, requiring computational reserves proportional to user volume—for instance, supporting 100 million users necessitates GPU capacity matching that scale. This requirement is massive and separate from training compute resources. According to available information, DS clearly lacks sufficient GPU and computational reserves, leading to frequent lags.

This contrast makes users accustomed to ChatGPT’s smooth experience uncomfortable, especially given their growing interest in R1.

2. Lag, lag, and more lag

Moreover, upon closer examination, the situations faced by OpenAI and DeepSeek differ significantly.

The former benefits from Microsoft’s backing; as OpenAI’s exclusive platform, Microsoft Azure Cloud hosts ChatGPT, DALL-E 2 image generator, and GitHub Copilot automated coding tools. This combination has become a classic cloud+AI paradigm, rapidly spreading as an industry standard. In contrast, despite being a startup, DeepSeek mostly relies on self-built data centers, similar to Google, rather than third-party cloud providers. After reviewing public information, Silicon Star found no actual cooperation between DeepSeek and any cloud or chip vendors at any level (although cloud providers announced during the Spring Festival that they were running DeepSeek models on their platforms, there was no meaningful collaboration).

In addition, DeepSeek is facing unprecedented user growth, leaving it with far less preparation time for emergency scenarios compared to ChatGPT.

DeepSeek’s strong performance results from holistic optimizations at the hardware and system levels. Its parent company, High-Flyer Quant, invested 200 million yuan back in 2019 to build the Firefly No.1 supercomputing cluster. By 2022, it quietly stockpiled tens of thousands of A100 GPUs. To enable more efficient parallel training, DeepSeek developed its own HAI LLM training framework. Industry observers believe the Firefly cluster likely incorporates several thousand to tens of thousands of high-performance GPUs (such as NVIDIA A100/H100 or domestic chips), providing robust parallel computing power. Currently, the Firefly cluster supports training for models including DeepSeek-R1 and DeepSeek-MoE, which perform close to GPT-4 levels in complex tasks like math and coding.

The Firefly cluster represents DeepSeek’s exploration of novel architectures and methodologies. It has led outsiders to believe that through such technological innovations, DS has reduced training costs, enabling them to train R1—an AI model comparable in performance to top-tier Western models—using only a fraction of the computational power. SemiAnalysis estimates suggest DeepSeek possesses vast computational reserves: stacking up to 60,000 NVIDIA GPU cards, including 10,000 A100s, 10,000 H100s, 10,000 "China-specific" H800s, and 30,000 "China-specific" H20s.

This may imply sufficient GPU availability for R1. However, in practice, R1—as an inference model—targets OpenAI’s o3, requiring substantial additional computational deployment for response handling. Whether DS saved more on training costs versus increased spending on inference remains unclear.

Notably, both DeepSeek-V3 and DeepSeek-R1 are large language models, but operate differently. DeepSeek-V3 is an instruction-following model, similar to ChatGPT, responding to prompts with generated text. DeepSeek-R1, however, is a reasoning model. When users pose questions to R1, it first undergoes extensive internal reasoning before generating final answers. The initial tokens produced by R1 consist largely of chain-of-thought reasoning steps—the model explains and breaks down the problem before delivering the answer, with all intermediate reasoning rapidly generated as tokens.

According to Wen Tingcan, Vice President at耀途 Capital, DeepSeek’s massive computational reserve refers specifically to the training phase, where teams can plan and anticipate needs, making shortages unlikely. Inference computing, however, is highly unpredictable due to dependence on user scale and usage volume, thus more elastic. “Inference computing grows according to certain patterns, but as DeepSeek became a phenomenon-level product, user numbers and usage exploded in a short time, causing inference compute demand to surge uncontrollably—hence the lag,” he said.

Gui Zang, a model product designer and independent developer active on Jike, agrees that insufficient GPU supply is the primary cause of DeepSeek’s lag. He believes that as the current highest-downloaded mobile app across 140 global markets, no amount of existing GPU capacity can sustain operations—“even adding new GPUs won’t help immediately because provisioning new cloud infrastructure takes time.”

“There is a fair market price for running NVIDIA A100, H100, and other chips per hour. From the perspective of token output inference cost, DeepSeek is over 90% cheaper than OpenAI’s equivalent o1 model—this aligns closely with general calculations. Therefore, the MOE model architecture itself isn’t the core issue. Rather, the number of GPUs DS owns determines the maximum number of tokens they can produce per minute. Even if more GPUs are allocated to inference services instead of pre-training research, the upper limit remains fixed,” said Chen Yunfei, developer of the AI-native app Xiaomao Bu Guangdeng, echoing similar views.

Some industry insiders told Silicon Star that the root of DeepSeek’s lag lies in poorly managed private cloud infrastructure.

Another contributing factor to R1’s lag is cyberattacks. On January 30, media learned from cybersecurity firm Qi An Xin that attacks against DeepSeek’s online services suddenly intensified, increasing over a hundredfold compared to January 28. Qi An Xin Xlab observed at least two botnets involved in the attack.

However, there appears to be an obvious solution to R1’s own service lag: third-party service provision. This was precisely what we witnessed during the Spring Festival—the liveliest scene being various vendors rushing to deploy services to meet demand for DeepSeek.

On January 31, NVIDIA announced that DeepSeek-R1 could now be used via NVIDIA NIM, shortly after NVIDIA lost nearly $600 billion in market value overnight due to DeepSeek-related impacts. On the same day, Amazon Web Services (AWS) users could deploy DeepSeek’s latest R1 foundation model on its AI platforms, Amazon Bedrock and Amazon SageMaker AI. Following this, emerging AI apps like Perplexity and Cursor also integrated DeepSeek en masse. Microsoft beat Amazon and NVIDIA to the punch, deploying DeepSeek-R1 on its Azure cloud service and GitHub.

Starting February 1—the fourth day of the Lunar New Year—Huawei Cloud, Alibaba Cloud, ByteDance’s Volcano Engine, and Tencent Cloud joined in, generally offering full-series, full-scale model deployment services for DeepSeek. Subsequently, AI chipmakers such as Biren Technology, Hanhai Semiconductor, Ascend, and MXCHIP claimed compatibility with either the original DeepSeek version or distilled, smaller variants. On the software side, companies like Yonyou and Kingdee integrated DeepSeek models into certain products to enhance functionality. Finally, device makers including Lenovo, Huawei, and Honor began incorporating DeepSeek models into select devices for on-device personal assistants and smart vehicle cabins.

To date, DeepSeek has attracted a vast network of partners based on its value proposition, encompassing international and domestic cloud providers, telecom operators, securities firms, and national-level platforms like the National Supercomputing Internet Platform. Because DeepSeek-R1 is a fully open-source model, all service providers integrating it benefit directly. While this dramatically amplified DeepSeek’s visibility, it also exacerbated frequent lag issues. Both service providers and DeepSeek itself have increasingly struggled under surging user traffic, unable to identify a key solution for stable usage.

Given that both DeepSeek V3 and R1 original models have高达 671 billion parameters, they are best suited for cloud environments. Cloud providers inherently possess more abundant computational and inference capabilities. Their launch of DeepSeek deployment services aims to lower enterprise adoption barriers. By deploying DeepSeek models and offering APIs externally, these providers were expected to deliver better user experiences than DS’s own API offerings.

In reality, however, the poor operational experience of the DeepSeek-R1 model persists across all vendor services. Although outsiders assume service providers aren’t lacking in GPUs, feedback from developers indicates instability frequencies identical to those seen on R1 itself—primarily because the number of GPUs allocated for R1 inference remains limited.

“R1 maintains high popularity, and service providers must balance multiple models they host. GPUs available for R1 are extremely limited. Given R1’s high demand, any provider launching R1 services—even at relatively low prices—gets overwhelmed instantly,” explained Gui Zang, model product designer and independent developer, to Silicon Star.

Model deployment optimization spans a broad domain involving numerous stages—from post-training to actual hardware deployment—yet for DeepSeek’s lag issue, the root cause may be simpler: an overly large model and inadequate optimization prior to launch.

Prior to launching a popular large model, teams face multifaceted challenges involving technology, engineering, and business operations—such as consistency between training and production environment data, data latency affecting real-time inference performance, excessive inference latency and resource consumption, insufficient model generalization, and engineering aspects like service stability, API integration, and system interoperability.

Most trending large models place heavy emphasis on inference optimization before launch due to concerns around computation time and memory usage. Long inference latencies degrade user experience and fail to meet responsiveness requirements—resulting in lag. High parameter counts consume excessive VRAM, sometimes exceeding single-GPU capacity, also leading to lag.

Wen Tingcan explained to Silicon Star that the challenges service providers face when offering R1 stem fundamentally from DS’s unique model structure—its sheer size combined with MOE (Mixture-of-Experts) architecture. “Optimization takes time, but market enthusiasm operates within a narrow window. So providers go live first and optimize later, rather than fully optimizing before launch.”

For R1 to run stably now, the key lies in capability related to inference-side reserves and optimization. What DeepSeek needs to do is find ways to reduce inference costs and decrease the number of tokens output per request.

At the same time, ongoing lag suggests DS’s actual computational reserves may not be as vast as SemiAnalysis claimed. High-Flyer Fund also consumes GPU resources, and DeepSeek’s training team requires them too—leaving few GPUs available for end users. Given current development trends, DeepSeek likely has little incentive in the short term to spend money renting external services to improve free user experience. Instead, they’ll probably wait until their initial consumer-facing business model becomes clear before considering service leasing—a sign that lag will persist for quite some time.

“They likely need two moves: 1) Implement a paid mechanism to limit free user model usage; 2) Partner with cloud service providers to leverage others’ GPU resources,” said developer Chen Yunfei. His proposed temporary fix enjoys broad consensus in the industry.

Yet currently, DeepSeek appears unconcerned about its "server busy" issue. As a company pursuing AGI, DeepSeek seems unwilling to focus too much on this sudden influx of user traffic. For the foreseeable future, users may simply have to get used to seeing the "server busy" screen.

Join TechFlow official community to stay tuned

Telegram:https://t.me/TechFlowDaily

X (Twitter):https://x.com/TechFlowPost

X (Twitter) EN:https://x.com/BlockFlow_News

Add to Favorites

Share to Social Media

Author

硅星人Pro

DeepSeek's "server busy" error is driving everyone crazy—what's really going on behind the scenes?

TechFlow Selected TechFlow Selected