
Variant Li Jin: Overcoming the AI Data Wall, Data DAOs Are Timely
TechFlow Selected TechFlow Selected

Variant Li Jin: Overcoming the AI Data Wall, Data DAOs Are Timely
Data DAOs represent a potentially promising path toward generating new high-quality datasets and overcoming the AI data wall.
Author: Li Jin
Compiled by: TechFlow
Data DAOs represent a promising pathway toward generating new high-quality datasets and overcoming the AI data wall.

High-profile data licensing agreements between OpenAI and News Corp, as well as Reddit, underscore the growing importance of high-quality data in AI. Today’s frontier models are trained on massive internet datasets—for example, Common Crawl indexes about 10% of web pages for LLM training, encompassing over 100 trillion tokens.
One path to further improving AI models is expanding and enhancing the data they can be trained on. We’ve been exploring mechanisms for aggregating data, particularly in decentralized ways. We’re especially interested in how decentralized approaches could help generate novel datasets while economically rewarding contributors and creators.
Over the past few years, within the cryptocurrency space, there has been discussion around the concept of data DAOs—collectives of individuals who create, organize, and manage data. Multicoin and others have covered this topic, but the rapid advancement of AI provides a new “why now” catalyst for data DAOs.
Data in Today’s AI
Currently, AI models are trained on public data either through partnerships like those between News Corp and Reddit, or by scraping data from the open internet. For instance, Meta's Llama 3 was trained on 15 trillion tokens from publicly available sources. These methods effectively aggregate large volumes of data quickly, but they come with limitations regarding both the content and the manner in which data is collected.
First, the "what": AI progress is bottlenecked by the quality and quantity of data. Leopold Aschenbrenner wrote about the so-called "data wall" constraining further algorithmic improvements: "Soon, the naive approach of pretraining large language models on more and more scrapable data will begin hitting serious bottlenecks."
One way to break through the data wall is increasing the availability of new datasets. For example, model companies cannot access logged-in user data without violating most websites’ terms of service, nor can they access data that hasn’t yet been aggregated. Additionally, vast amounts of private data remain out of reach for current AI training—such as enterprise Google Drives, corporate Slacks, personal health records, or private messages.
Second, the "how": Under existing models, companies that aggregate data capture most of the value. Reddit’s S-1 lists data licensing as a key anticipated revenue stream: "We expect our growing data advantage and intellectual property to continue being critical elements for future LLM training." Yet end users who generate the actual content receive no economic benefit from these licensing deals or from the AI models themselves. This misalignment may dampen participation—already we’re seeing lawsuits against generative AI firms and movements to opt out of training datasets. Not to mention the socioeconomic implications of concentrating revenue in the hands of model companies or platforms rather than end users.
The Impact of Data DAOs
The above data challenges share a common thread: they benefit from large-scale contributions from diverse, representative user groups. While any single data point may have negligible value for model performance, collectively, a large user base can pool into valuable new datasets for AI training. This is precisely where data DAOs come in. Through data DAOs, data contributors can not only gain economic rewards but also govern how their data is used and monetized.
Data DAOs can address several gaps in today’s data landscape, including but not limited to the following areas:
Real-World Data
In the decentralized physical infrastructure (DePIN) space, networks like Hivemapper collect up-to-date global map data by incentivizing dashcam owners to contribute their footage, and applications reward users for submitting real-time information (e.g., road closures or construction). DePIN can be viewed as real-world data DAOs, where datasets are generated by hardware devices and/or user networks. These datasets hold commercial value for various companies, and revenues are returned to contributors via token rewards.
Personal Health Data
Biohacking is a social movement in which individuals and communities conduct self-experimentation to study biology. For example, people might try different nootropics, test therapies or environmental changes to improve sleep, or even self-administer experimental drugs.
Data DAOs can bring structure and incentives to such biohacking activities by organizing participants in shared experiments and systematically collecting results. Revenue from research labs or pharmaceutical companies using personal health DAOs could be distributed back to contributing participants via token rewards.
Reinforcement Learning from Human Feedback (RLHF)
Fine-tuning AI models through human feedback (RLHF) involves leveraging human input to enhance AI system performance. Typically, feedback providers need to be domain experts capable of effectively evaluating model outputs. For instance, a lab might recruit PhDs in mathematics to improve its LLM’s math capabilities. Token-based incentives can attract and motivate expert participation through speculative upside, while cryptographic payment rails enable global accessibility. Companies like Sapien, Fraction, and Sahara are working in this space.
Private Data
As publicly available data for AI training becomes increasingly exhausted, competition may shift toward proprietary datasets—including private user data. Vast quantities of high-quality data still lie behind login walls, direct messages, private documents, and similar silos. This data could not only effectively train personalized AIs but also contain valuable insights inaccessible on the public web.
However, accessing and utilizing this data presents major challenges, including legal and ethical considerations. Data DAOs could offer a solution by allowing willing participants to upload and monetize their data while governing its usage. For example, the Reddit Data DAO allows users to upload Reddit data exported from the platform—including comments, posts, and voting history—into a database that can be sold or rented to AI companies in a privacy-preserving manner. Token incentives allow users to earn not just from one-time transactions but ongoing returns based on the value created when AI models use their data.
Open Questions and Challenges
While the potential benefits of data DAOs are significant, several considerations and challenges remain.
Distorting Effects of Incentives
From the history of token incentives in crypto, we know that external rewards can alter user behavior. This directly affects how token incentives are used for data collection: incentives may skew participant demographics and the types of data contributed.
Introducing token rewards may also encourage participants to maximize gains by submitting low-quality or fabricated data. This is particularly concerning because the revenue potential of these data DAOs depends on data quality. If contributed data is distorted, the value of the dataset diminishes.
Data Measurement and Reward
At the heart of data DAOs is the idea of rewarding contributors with tokens, with long-term payouts converging with the DAO’s revenue. However, accurately rewarding different data contributions is difficult due to the subjective nature of data value. For example, in the biohacking case: is some users’ data more valuable than others? If so, what determines that? For mapping data: is map information from certain geographic regions more valuable than others, and how can such differences be quantified? Research is underway into measuring data value by calculating its incremental contribution to model performance, but these methods can be computationally intensive.
Moreover, establishing robust mechanisms to verify data authenticity and accuracy is crucial. Without such safeguards, systems may be vulnerable to fraudulent submissions (e.g., fake accounts) or Sybil attacks. DePIN networks attempt to solve this by integrating at the hardware level, but other user-driven data DAOs may be more susceptible to manipulation.
Incrementality of New Data
Much of the public web has already been used for training, so data DAO operators must consider whether datasets collected through distributed efforts are truly incremental and add value beyond existing public web data. They must also assess whether researchers can license data from the platform or obtain it otherwise. These considerations highlight the importance of gathering data that goes beyond what already exists—leading directly to the next issue: scale of impact and revenue potential.
Scale of Revenue Opportunities
At their core, data DAOs are building two-sided markets connecting data buyers with data contributors. The success of a data DAO hinges on attracting a stable and diverse customer base willing to pay for data.
Data DAOs need to identify and validate their ultimate demand, ensuring that revenue opportunities are large enough—both in total and per contributor—to incentivize the necessary volume and quality of data. For example, the idea of creating a user data DAO to pool personal preferences and browsing data for advertising has been discussed for years, but ultimately, the income such a network could pass back to users may be trivial. (For comparison, Meta’s global ARPU was $13.12 at the end of 2023.) With AI companies planning to spend trillions of dollars on training, user data revenue might now be compelling enough to drive mass participation—presenting a compelling “why now” moment for data DAOs.
Overcoming the Data Wall
Data DAOs represent a potentially promising path toward generating new high-quality datasets and overcoming the AI data wall. Exactly how this will unfold remains to be seen, but we’re excited to watch developments in this space.
Join TechFlow official community to stay tuned
Telegram:https://t.me/TechFlowDaily
X (Twitter):https://x.com/TechFlowPost
X (Twitter) EN:https://x.com/BlockFlow_News










