Foresight Ventures: Analysis and Reflections on Decentralized Computing Networks

2023.06.02

Foresight Ventures: Analysis and Reflections on Decentralized Computing Networks

Under the trend of AI large models, computing power resources will be the major battleground of the next decade and the most important element for future human society.

2023.06.02 - 01:13:04

去中心化算力网络

Navigating Web3 tides with focused insights

Under the trend of AI large models, computing power resources will be the major battleground of the next decade and the most important element for future human society.

Author: Yihan@Foresight Ventures

Abstract

Currently, there are two major directions for combining AI and crypto: decentralized computing power and ZKML. For ZKML, please refer to one of my previous articles. This article will focus on analyzing and reflecting on decentralized distributed computing networks.
Under the trend of large AI model development, computing power resources will be the next decade’s biggest battleground and the most important asset for future human society, extending beyond commercial competition into strategic resources in great-power rivalry. Investment in high-performance computing infrastructure and computing reserves will grow exponentially.
Decentralized distributed computing networks face the highest demand but also the greatest challenges and technical bottlenecks in training large AI models. These include complex data synchronization and network optimization issues. Additionally, data privacy and security remain critical constraints. While some existing technologies offer preliminary solutions, they remain impractical for large-scale distributed training due to enormous computational and communication overheads.
Decentralized distributed computing networks have greater potential for model inference applications, with substantial projected growth. However, they still face challenges such as communication latency, data privacy, and model security. Compared to model training, inference involves lower computational complexity and less data interaction, making it more suitable for distributed environments.
Through case studies of startups Together and Gensyn.ai, this article illustrates the overall research direction and specific approaches of decentralized distributed computing networks from the perspectives of technical optimization and incentive layer design.

Distributed Computing — Large Model Training

When discussing distributed computing applications in training, we usually focus on large language models (LLMs), primarily because small models do not require significant computing power. Addressing data privacy and engineering complexities for decentralization is not cost-effective compared to centralized solutions. In contrast, LLMs demand immense computing power, especially during their explosive early stages. From 2012 to 2018, AI computational needs doubled approximately every four months, and today's demand remains highly concentrated. We can anticipate continued strong incremental demand over the next 5–8 years.

While huge opportunities exist, we must clearly recognize the challenges. Everyone knows the market is large, but where exactly are the pain points? The key to identifying outstanding projects lies in whether they target these core problems instead of blindly entering the space.

(NVIDIA NeMo Megatron Framework)

1. Overall Training Process

Take training a 175-billion-parameter model as an example. Due to its massive size, parallel training across many GPUs is required. Assume a centralized data center has 100 GPUs, each with 32 GB of memory.

Data Preparation: A massive dataset containing internet content, news, books, etc., is needed. Before training, this data must be preprocessed—text cleaning, tokenization, vocabulary building, etc.
Data Partitioning: The processed data is split into multiple batches for parallel processing across GPUs. Assuming a batch size of 512 (each batch contains 512 text sequences), the entire dataset is divided into a queue of batches.
Inter-device Data Transfer: At the start of each training step, the CPU fetches a batch from the queue and sends it to the GPU via the PCIe bus. Assuming an average sequence length of 1024 tokens and 4 bytes per token (single-precision float), each batch is about 512 × 1024 × 4 B = 2 MB. This transfer typically takes just milliseconds.
Parallel Training: Each GPU performs forward and backward passes upon receiving data, calculating gradients for parameters. Since a single GPU cannot hold all parameters due to memory limitations, model parallelism is used to distribute parameters across multiple GPUs.
Gradient Aggregation and Parameter Update: After backpropagation, each GPU holds partial gradients. These gradients must be aggregated across all GPUs to compute global gradients. This requires network transmission; assuming a 25 Gbps network, transferring 700 GB (175 billion parameters at 4 bytes each) takes ~224 seconds. Then, each GPU updates its parameters using the global gradient.
Synchronization: After parameter updates, all GPUs must synchronize to ensure consistent model parameters for the next training step. This also requires network data transfer.
Repeat Training Steps: Repeat the above steps until all batches are processed or a predetermined number of epochs is reached.

This process involves extensive data transfer and synchronization, which may become a bottleneck for training efficiency. Therefore, optimizing bandwidth and latency, along with efficient parallel and synchronization strategies, is crucial for large-scale model training.

2. Communication Overhead Bottleneck:

Note that communication bottlenecks are precisely why current decentralized computing networks cannot support large language model training.

Nodes must frequently exchange information to coordinate work, resulting in communication overhead. This problem is particularly severe for LLMs due to their vast number of parameters. Communication overhead manifests in several aspects:

Data Transmission: Nodes frequently exchange model parameters and gradients during training, requiring massive data transfers that consume significant bandwidth. Poor network conditions or large geographical distances between nodes increase transmission delay, further exacerbating communication overhead.
Synchronization Issues: Nodes must coordinate to ensure correct training progress, requiring frequent synchronization operations like updating model parameters and computing global gradients. These involve heavy data transmission and waiting for all nodes to complete, causing significant communication overhead and idle time.
Gradient Accumulation and Update: Each node computes local gradients and transmits them for aggregation and update. This requires extensive gradient data transmission and synchronization across nodes, contributing heavily to communication overhead.
Data Consistency: Model parameters across nodes must remain consistent, necessitating frequent data validation and synchronization, leading to additional communication overhead.

Although methods like parameter/gradient compression and advanced parallel strategies can reduce communication overhead, they may introduce extra computation burdens or negatively impact training performance. Moreover, these techniques cannot fully resolve communication issues, especially under poor network conditions or large inter-node distances.

Example:

Decentralized Distributed Computing Network

GPT-3 has 175 billion parameters. Represented as single-precision floats (4 bytes each), storing these parameters requires ~700 GB of memory. During distributed training, these parameters must be frequently transmitted and updated across computing nodes.

Assume 100 computing nodes, each needing to update all parameters per step. Each step would then require transmitting ~70 TB (700 GB × 100) of data. With an optimistic assumption of 1 second per step, this implies a sustained data rate of 70 TB/s—far exceeding most networks and rendering it practically infeasible.

In reality, communication delays and network congestion mean data transfer times could far exceed 1 second. Computing nodes may spend more time waiting than computing, drastically reducing training efficiency—not just slower, but fundamentally unworkable.

Centralized Data Center

Even in centralized data centers, large model training requires intensive communication optimization.

In centralized environments, high-performance devices form clusters connected via high-speed networks. Even so, communication overhead remains a bottleneck due to frequent transmission and updates of model parameters and gradients.

As previously mentioned, assume 100 nodes each with 25 Gbps bandwidth. Updating all parameters per step (~700 GB) would take ~224 seconds. Leveraging advantages of centralized facilities, developers can optimize internal network topology and use model parallelism to significantly reduce this time.

By comparison, in a distributed environment with 100 globally dispersed nodes averaging only 1 Gbps bandwidth, transmitting the same 700 GB would take ~5600 seconds—much longer than in a centralized setup. Real-world delays due to latency and congestion would make it even worse.

Compared to decentralized computing networks, optimizing communication in centralized data centers is relatively easier. Centralized equipment connects through high-speed networks with superior bandwidth and low latency. In contrast, decentralized nodes suffer from variable and often inferior network conditions, worsening communication overhead.

OpenAI used the Megatron model-parallel framework to address communication overhead when training GPT-3. Megatron partitions model parameters across multiple GPUs, with each device handling only a subset, thereby reducing per-device workload and communication costs. High-speed interconnects and optimized network topologies were also employed to shorten communication paths.

(Data used to train LLM models)

3. Why Decentralized Computing Networks Cannot Achieve These Optimizations

It's technically possible, but these optimizations are far less effective compared to centralized data centers.

1. Network Topology Optimization: In centralized data centers, direct control over hardware and layout allows custom network topology design and optimization. In decentralized environments, nodes span different geographies—one in China, another in the U.S.—making direct control over their connections impossible. Software-based routing optimization exists but is less effective than physical network tuning. Geographic dispersion also introduces unpredictable variations in latency and bandwidth, further limiting optimization effectiveness.

2. Model Parallelism: Model parallelism splits model parameters across nodes to accelerate training. However, it demands frequent inter-node data transfers, requiring high bandwidth and low latency. It works well in centralized data centers due to excellent network conditions, but suffers severely in decentralized settings with poor connectivity.

4. Challenges in Data Security and Privacy

Nearly every stage involving data processing and transmission risks compromising data security and privacy:

1. Data Distribution: Training data must be assigned to participating nodes. Malicious nodes may misuse or leak this data.

2. Model Training: Nodes perform computations on assigned data and output parameter updates or gradients. If these processes are intercepted or results maliciously analyzed, sensitive data could be exposed.

3. Parameter and Gradient Aggregation: Outputs from nodes must be aggregated to update the global model. Communications during aggregation may leak information about the training data.

What solutions exist for data privacy?

Secure Multi-Party Computation (SMC): Already successful in small-scale, specific tasks. However, due to high computational and communication overhead, SMC is not yet widely applicable to large-scale distributed training.
Differential Privacy (DP): Used in data collection and analytics (e.g., Chrome user statistics). But in large-scale deep learning, DP harms model accuracy. Designing proper noise generation and injection mechanisms is also challenging.
Federated Learning (FL): Applied in edge device training (e.g., Android keyboard prediction). However, FL faces scalability issues in large distributed tasks due to high communication overhead and coordination complexity.
Homomorphic Encryption: Successfully used in low-complexity tasks. But due to high computational cost, it remains impractical for large-scale distributed training.

In summary

Each method has its niche and limitations; none currently offers a complete solution to data privacy in decentralized large-model training.

Can ZK solve data privacy issues in large model training?

Theoretically, zero-knowledge proofs (ZKPs) can protect data privacy in distributed computing by allowing a node to prove it performed computation correctly without revealing inputs or outputs.

However, applying ZKPs to large-scale decentralized training faces major bottlenecks:

High Computational and Communication Overhead: Constructing and verifying ZKPs demands substantial computing resources. The proofs themselves are large, increasing communication costs. In large model training, these overheads become prohibitive. For instance, generating a proof for every mini-batch would drastically increase training time and cost.
ZK Protocol Complexity: Designing and implementing a ZKP protocol suitable for large model training is extremely complex. It must handle massive datasets and intricate computations while managing potential errors.
Hardware and Software Compatibility: ZKP requires specialized hardware and software support, which may not be available across all decentralized computing devices.

In summary

Applying ZKPs to large-scale decentralized model training requires years of further research and development, plus increased academic focus and resource allocation.

Distributed Computing — Model Inference

Another major application of distributed computing lies in model inference. Based on our understanding of the evolution of large models, training demand will eventually plateau as models mature, while inference demand will grow exponentially alongside advancements in large models and AIGC.

Inference tasks generally have lower computational complexity and weaker data interdependence, making them better suited for distributed environments.

(Power LLM inference with NVIDIA Triton)

1. Challenges

Communication Latency:

Inter-node communication is essential in distributed environments. In decentralized networks, nodes may be globally distributed, introducing latency—especially problematic for real-time inference tasks.

Model Deployment and Updates:

Models must be deployed across nodes. When models are updated, each node must upgrade, consuming considerable bandwidth and time.

Data Privacy:

Though inference mainly requires input data and the model, with minimal intermediate data returned, inputs may still contain sensitive information like personal user data.

Model Security:

In decentralized networks, models run on untrusted nodes, risking intellectual property theft and misuse. Security and privacy risks arise if attackers analyze model behavior to infer sensitive information, especially when handling confidential data.

Quality Control:

Nodes in a decentralized network vary in computing capacity and resources, potentially affecting inference performance and result consistency.

2. Feasibility

Computational Complexity:

During training, models undergo repeated iterations involving forward and backward propagation, activation functions, loss calculation, gradient computation, and weight updates—resulting in high computational complexity.

In contrast, inference requires only a single forward pass to generate predictions. For example, in GPT-3, input text is converted into vectors and passed through Transformer layers to produce a probability distribution for the next word. In GANs, a noise vector generates an image. These involve only forward propagation—no gradient computation or parameter updates—making them computationally lighter.

Data Interactivity:

Inference typically handles individual inputs rather than large batches. Each output depends solely on the current input, eliminating the need for extensive data interaction and reducing communication pressure.

For generative image models like GANs, inputting a noise vector yields one image. Outputs are independent, requiring no cross-input coordination.

Similarly, in GPT-3, predicting the next word depends only on current context and model state, with no dependency on other inputs or outputs—minimal interactivity required.

In Summary

Whether for large language models or generative image models, inference tasks have lower computational complexity and weaker data interactivity, making them better suited for decentralized distributed computing networks. This explains why most current projects focus on this direction.

Projects

Decentralized distributed computing networks have extremely high technical barriers and broad requirements, including hardware support, which is why few attempts exist today. Consider Together and Gensyn.ai as examples:

1. Together

(RedPajama from Together)

Together is a company focused on open-sourcing large AI models and developing decentralized AI computing solutions, aiming to make AI accessible to anyone, anywhere. Together recently raised a $20M seed round led by Lux Capital.

Founded by Chris, Percy, and Ce, Together was born from the observation that large model training requires expensive GPU clusters, concentrating resources and capabilities within a few large corporations.

From my perspective, a reasonable startup roadmap for decentralized computing might look like this:

Step 1. Open-Source Models

To enable model inference on a decentralized computing network, nodes must access models at low cost—ideally via open-source licenses. Proprietary models (e.g., ChatGPT) are unsuitable due to licensing complexity and cost.

Thus, a hidden barrier for decentralized computing providers is strong large-model development and maintenance capability. Building and open-sourcing a robust base model reduces reliance on third-party models, solves foundational deployment issues, and demonstrates the network’s ability to efficiently train and run large models.

Together follows this path. RedPajama, based on LLaMA and co-developed with Ontocord.ai, ETH DS3 Lab, Stanford CRFM, and Hazy Research, aims to build a series of fully open-source large language models.

Step 2. Deploy Distributed Computing for Model Inference

As discussed earlier, inference has lower computational and interactive demands, making it ideal for decentralized environments.

Building on open-source models, Together’s team optimized the RedPajama-INCITE-3B model with techniques like LoRA for low-cost fine-tuning, enabling smoother CPU execution—even on MacBook Pros with M2 Pro chips. Despite its smaller scale, the model outperforms peers and has been applied in legal and social contexts.

Step 3. Extend to Model Training on Distributed Computing

(Computing network diagram from "Overcoming Communication Bottlenecks for Decentralized Training")

Long-term, despite significant challenges, supporting large model training remains the most attractive goal. From inception, Together has researched overcoming communication bottlenecks in decentralized training, publishing a paper at NeurIPS 2022: "Overcoming Communication Bottlenecks for Decentralized Training." Key directions include:

Scheduling Optimization

In decentralized environments, node connections vary in latency and bandwidth. Assigning communication-heavy tasks to faster-connected devices is crucial. Together models scheduling cost to optimize strategies, minimizing communication overhead and maximizing throughput. Their research shows that even with 100x slower networks, end-to-end training throughput drops only 1.7–2.3x, suggesting scheduling can narrow the gap between decentralized and centralized systems.

Communication Compression Optimization

Together proposed compressing forward activations and backward gradients, introducing the AQ-SGD algorithm, which guarantees convergence for stochastic gradient descent. AQ-SGD enables fine-tuning large foundation models on slow networks (e.g., 500 Mbps), achieving only 31% slower performance than uncompressed centralized training (e.g., 10 Gbps). Combined with state-of-the-art gradient compression (e.g., QuantizedAdam), AQ-SGD achieves a 10% end-to-end speedup.

Project Summary

Together boasts a well-rounded team with strong academic backgrounds spanning large models, cloud computing, and hardware optimization. Their strategic roadmap—from open-sourcing large models, testing idle computing (e.g., Macs) for inference, to tackling decentralized training—exudes long-term patience and preparation—a true case of accumulating strength for a breakthrough.

However, there is limited public evidence of progress on incentive layer design, which I believe is equally important and critical for ensuring the sustainable development of decentralized computing networks.

2. Gensyn.ai

(Gensyn.ai)

From Together’s technical approach, we gain insight into how decentralized computing networks can deploy model training and inference and what R&D priorities matter.

Another critical aspect is the design of the incentive layer / consensus algorithm. An effective network should:

1. Offer sufficiently attractive rewards;

2. Ensure fair compensation and prevent cheating (“pay-for-work”);

3. Efficiently schedule and assign tasks to avoid idle or overloaded nodes;

4. Keep the incentive mechanism simple and efficient, avoiding excessive system load or latency.

How does Gensyn.ai approach this?

Becoming a Node

First, solvers in the computing network bid for the right to process user-submitted tasks. Depending on task size and fraud risk, solvers must stake a certain amount.

Verification

While updating parameters, solvers generate multiple checkpoints (for transparency and traceability) and periodically produce cryptographic inference proofs (verifiable progress reports).

Once a solver completes part of the computation, the protocol selects a verifier (who also stakes funds to ensure honest verification) to check portions of the result based on the provided proofs.

If Disagreement Occurs Between Solver and Verifier

Using a Merkle tree structure, the protocol pinpoints the exact location of discrepancy. All verification actions are recorded on-chain, and fraudulent parties lose their staked funds.

Project Summary

The incentive and verification design allows Gensyn.ai to avoid re-executing entire computations during verification. Instead, only selected portions are verified using provided proofs—greatly improving efficiency. Nodes store only partial results, reducing storage and computational burden. Furthermore, potential cheaters cannot predict which parts will be checked, lowering fraud incentives.

This method of identifying discrepancies without comparing full results (by traversing from the Merkle root downward) efficiently locates computation errors—especially valuable for large-scale tasks.

Overall, Gensyn.ai’s incentive/verification layer aims for simplicity and efficiency. However, it remains theoretical for now, facing implementation challenges:

Economically, setting parameters that effectively deter fraud without creating high entry barriers remains difficult.
Technically, designing effective periodic cryptographic inference proofs requires advanced cryptography expertise.
Task assignment needs intelligent scheduling algorithms. Simply bidding-based allocation raises concerns: powerful nodes capable of large tasks may not bid (availability incentive issue), while weak nodes with high bids may be unfit for complex jobs.

Final Thoughts on the Future

The fundamental question—"Who actually needs decentralized computing networks?"—remains unproven. Applying idle computing to large model training seems most logical and promising. Yet communication and privacy bottlenecks force us to reconsider:

Is there real hope for decentralized large model training?

If we step outside the consensus view of the “most rational use case,” could decentralized computing for small AI model training represent a significant opportunity? Technically, current limitations tied to model scale and architecture diminish. Market-wise, while large model training dominates attention, is the small model market truly unattractive?

I think not. Small AI models are easier to deploy and manage, more efficient in speed and memory usage. In many scenarios, users or companies don’t need general reasoning from large language models but only targeted predictions. Thus, in most practical cases, small AI models remain the more viable choice—deserving attention beyond the FOMO-driven large model frenzy.

Join TechFlow official community to stay tuned

Telegram:https://t.me/TechFlowDaily

X (Twitter):https://x.com/TechFlowPost

X (Twitter) EN:https://x.com/BlockFlow_News

Source

Add to Favorites

Share to Social Media

Author

Foresight Ventures

Foresight Ventures: Analysis and Reflections on Decentralized Computing Networks

TechFlow Selected TechFlow Selected