Flower AI and Vana are building advanced AI models without data centers
TechFlow Selected TechFlow Selected
Flower AI and Vana are building advanced AI models without data centers
A new crowdsourced training approach for developing large language models (LLMs) via the internet could shake the AI industry later this year with a massive 100 billion parameter model.
A new crowdsourced training approach for developing large language models (LLMs) over the internet could shake up the AI industry later this year with a massive 100-billion-parameter model.
Researchers have trained a novel large language model (LLM) using GPUs distributed globally and combining private and public data, signaling a potential disruption to mainstream AI development. Two startups, Flower AI and Vana, are jointly behind this unconventional effort, naming their new model Collective-1.
Flower developed technology that distributes training across hundreds of computers connected via the internet. The company's tools have already been used by other organizations to train AI models without centralizing computing resources or data. Vana contributed data sources, including private messages from X, Reddit, and Telegram.
By modern standards, Collective-1 is relatively small, with 7 billion parameters—the combinations of which give the model its capabilities—compared to today’s leading models like ChatGPT, Claude, and Gemini, which number in the tens or hundreds of billions. Nic Lane, a computer scientist at the University of Cambridge and co-founder of Flower AI, said the distributed approach promises to go far beyond the scale of Collective-1. Lane added that Flower AI is currently training a 30-billion-parameter model on conventional data and plans to train another with 100 billion parameters later this year—approaching the scale of industry leaders. “This could really change how people think about AI, so we’re working very hard on it,” Lane said. He also said the startup is incorporating images and audio into training to create multimodal models.
Distributed model building could also shift the power dynamics shaping the AI industry. Currently, AI companies build models by combining vast training datasets with powerful computing capacity concentrated in data centers equipped with advanced GPUs and linked by ultra-high-speed fiber-optic cables. They also heavily rely on datasets created by scraping publicly accessible (though sometimes copyrighted) materials, including websites and books.
This means only the wealthiest companies and nations with access to large numbers of powerful chips can develop the most capable and valuable models. Even open-source models like Meta’s Llama and DeepSeek’s R1 were built by companies with large data centers. Distributed methods could enable smaller companies and universities to build advanced AI by pooling diverse resources. Alternatively, they might allow countries lacking traditional infrastructure to network multiple data centers to build more powerful models.
Lane believes the AI industry will increasingly seek new ways to scale training beyond the limits of individual data centers. “Distributed approaches let you scale compute in a much more elegant way than data-center models,” he said.
Helen Toner, an AI governance expert at the Center for Security and Emerging Technology, said Flower AI’s approach is “interesting and potentially highly relevant” to AI competition and governance. “It may continue to struggle at the cutting edge, but could be an interesting fast-follower approach,” Toner said.
Divide and Conquer
Distributed AI training involves rethinking how computation for building powerful AI systems is partitioned. Creating an LLM involves feeding large amounts of text into a model, which adjusts its parameters to generate useful responses to prompts. Inside data centers, the training process is divided so parts can run on different GPUs and are periodically merged back into a master model.
The new method allows work typically done inside large data centers to instead occur on hardware possibly miles apart and connected via relatively slow or unstable internet connections.
Some large companies are also exploring distributed learning. Last year, Google researchers demonstrated a new scheme for partitioning and integrating computation called DIstributed PAth COmposition (DiPaCo), making distributed learning more efficient.
To build Collective-1 and other LLMs, Lane and academic collaborators from the UK and China developed a new tool called Photon to make distributed training more efficient. Lane said Photon is more efficient than Google’s approach in terms of data representation, sharing, and integration of training. The process is slower than conventional training but more flexible, allowing new hardware to be added to speed up training.
Photon was developed in collaboration with researchers from Beijing University of Posts and Telecommunications and Zhejiang University. The team released the tool last month under an open-source license, allowing anyone to use the method.
Flower AI partnered with Vana in building Collective-1. Vana is developing new methods for users to share personal data with AI developers. Vana’s software allows users to contribute private data from platforms like X and Reddit for training large language models and potentially specify permitted end uses—or even profit from their contributions.
Vana co-founder Anna Kazlauskas said the idea is to make previously untapped data available for AI training while giving users greater control over how their information is used in AI. “This data is typically not included in AI models because it’s not publicly available,” Kazlauskas said. “This is the first time user-contributed data has directly trained foundation models, and users retain ownership of the AI models created from their data.”
Mirco Musolesi, a computer scientist at University College London, said a key benefit of distributed AI training may be unlocking new types of data. “Scaling this to frontier models would allow the AI industry to leverage vast amounts of decentralized and privacy-sensitive data—for example, in healthcare and finance—for training, without facing the risks associated with data centralization,” he said.
What do you think about distributed machine learning?
Join TechFlow official community to stay tuned
Telegram:https://t.me/TechFlowDaily
X (Twitter):https://x.com/TechFlowPost
X (Twitter) EN:https://x.com/BlockFlow_News












