
Can AI Survive in the Crypto World: Testing 18 Large Models
TechFlow Selected TechFlow Selected

Can AI Survive in the Crypto World: Testing 18 Large Models
The establishment of benchmarking could become a key bridge connecting the AI and crypto domains, catalyzing innovation and providing clear guidance for future applications.
Author: Wang Chao
In the annals of technological progress, revolutionary technologies often emerge independently, each driving transformative change in its era. Yet when two such disruptive forces converge, their collision can yield exponential impact. We are now at precisely such a historic juncture: artificial intelligence and cryptography—two equally transformative technologies—are stepping together into the spotlight.
We envision cryptographic techniques solving many of AI's core challenges; we anticipate AI agents building autonomous economic networks that drive mass adoption of crypto; and we hope AI will accelerate the evolution of existing crypto use cases. Intense attention and massive capital inflows have converged here, as with any buzzword—it embodies humanity’s yearning for innovation, aspirations for the future, and also unchecked ambition and greed.
Yet amid this noise, we remain profoundly ignorant of the most fundamental questions: How well does AI truly understand the crypto domain? Do LLM-powered agents possess practical ability to use cryptographic tools? How large are the performance gaps between different models on crypto-related tasks?
The answers to these questions will determine the mutual influence between AI and crypto, and are critical for product directions and technical roadmap decisions in this interdisciplinary space. To explore them, I conducted a series of evaluation experiments on large language models. By assessing their knowledge and capabilities in the crypto domain, I aimed to gauge the maturity of AI for crypto applications and evaluate the potential and challenges of integrating AI with crypto.
Key Takeaways
Large language models excel in cryptography and blockchain fundamentals and demonstrate solid understanding of the crypto ecosystem, but perform poorly in mathematical computation and complex business logic analysis. Models show satisfactory baseline competence in private key handling and basic wallet operations, yet face serious challenges regarding secure cloud-based private key storage. Many models can generate valid smart contract code for simple scenarios, but cannot independently perform high-difficulty tasks like contract audits or creating complex contracts.
Commercial closed-source models lead overall, with only Llama 3.1-405B standing out among open-source models—smaller open-source models universally underperform. However, there is clear potential: through prompt engineering, chain-of-thought reasoning, and few-shot learning, all models showed significant performance gains. Leading models already demonstrate strong technical feasibility in certain vertical application scenarios.
Experiment Details
I selected 18 representative language models for evaluation:
-
Closed-source models: GPT-4o, GPT-4o Mini, Claude 3.5 Sonnet, Gemini 1.5 Pro, Grok2 beta (currently closed-source)
-
Open-source models: Llama 3.1 8B/70B/405B, Mistral Nemo 12B, DeepSeek-coder-v2, Nous-hermes2, Phi3 3.8B/14B, Gemma2 9B/27B, Command-R
-
Math-optimized models: Qwen2-math-72B, MathΣtral
These span mainstream commercial and popular open-source models, with parameter counts ranging from 3.8B to 405B—a hundredfold difference. Given crypto’s close ties to mathematics, two math-specialized models were included.
The evaluation covered domains including cryptography, blockchain fundamentals, private keys and wallet operations, smart contracts, DAOs and governance, consensus and economic models, DApps/DeFi/NFTs, and on-chain data analysis. Each domain consisted of a series of progressively difficult questions and tasks designed not only to test knowledge but also real-world applicability via simulation.
Task design was diverse: partly informed by input from multiple crypto experts, partly AI-assisted and manually reviewed to ensure accuracy and challenge level. Some tasks used simple multiple-choice formats for standardized automated scoring. Others employed more complex formats evaluated via a hybrid approach combining automation, human review, and AI assistance. All assessments used zero-shot inference—no examples, reasoning prompts, or instructional cues were provided.
The experiment remains rough and lacks academic rigor. The question set is far from comprehensive, the testing framework immature. Thus, this article omits specific data and instead focuses on key insights.
Knowledge / Concepts
During evaluation, large language models performed exceptionally well on foundational knowledge tests across cryptography, blockchain basics, and DeFi applications. For instance, on an open-ended question about data availability, all models provided accurate answers. On Ethereum transaction structure, despite minor differences in detail, all models conveyed correct key information. Multiple-choice concept checks posed no difficulty—nearly all models scored over 95% accuracy.
Conceptual Q&A poses no real challenge to large models.
Computation / Business Logic
However, performance flipped dramatically on tasks requiring actual calculation. A simple RSA computation problem stumped most models. This is understandable: LLMs operate primarily by recognizing and replicating patterns in training data, rather than deeply comprehending mathematical principles. This limitation becomes especially apparent with abstract math concepts like modular arithmetic and exponentiation. Given crypto’s heavy reliance on mathematics, this implies direct reliance on models for crypto-related mathematical calculations is unreliable.
Performance on other computational tasks was similarly disappointing. For a straightforward AMM impermanent loss calculation—requiring no advanced math—only 4 out of 18 models produced correct answers. Even a basic block probability calculation question was answered incorrectly by every single model. This exposes not only weaknesses in precise computation but also major shortcomings in business logic reasoning. Notably, even math-optimized models failed to show clear advantages here—their performance was underwhelming.
However, the math problem isn't insurmountable. When asked to output Python code instead of direct results, accuracy improves significantly. For the RSA example, most models generated executable code that correctly solved the problem. In production environments, providing pre-defined algorithmic libraries can bypass the need for models to compute directly—mirroring how humans handle such tasks. On the business logic side, carefully engineered prompts can also substantially improve model performance.
Private Key Management & Wallet Operations
If asked what the first crypto use case for an agent might be, my answer would be payments. Cryptocurrency could be seen as natively AI-compatible money. Compared to the numerous barriers agents face in traditional finance, using crypto to establish digital identity and manage funds via crypto wallets is a natural fit. Therefore, private key generation and management, along with wallet operations, constitute the most fundamental skills for agent autonomy on crypto networks.
Secure private key generation hinges on high-quality randomness—a capability clearly beyond large language models. Nevertheless, models show adequate awareness of private key security. When prompted to generate a key, most respond by offering code (e.g., using Python libraries) to guide users in generating their own keys locally. Even models that directly output a key typically clarify it’s for demonstration only and not suitable for real use. In this regard, all large models demonstrated acceptable behavior.
Private key management presents greater challenges—not due to model capability, but inherent architectural constraints. With locally deployed models, generated keys can be considered relatively secure. But with commercial cloud-hosted models, we must assume the key is exposed to the provider the moment it's created. Yet for agents aiming for independent operation, possessing private key authority is essential—meaning keys cannot reside solely on user devices. In such cases, relying solely on the model is insufficient. Trusted execution environments (TEEs) or hardware security modules (HSMs) must be introduced as additional safeguards.
Assuming an agent securely holds a private key, all tested models demonstrated strong capability in performing basic operations. While the steps and code they produce often contain errors, these issues are largely addressable within proper engineering architectures. From a technical standpoint, enabling agents to autonomously perform basic wallet operations faces few remaining obstacles.
Smart Contracts
An agent’s ability to understand, use, write, and identify risks in smart contracts is crucial for executing complex tasks on-chain—making this a focal point of the experiment. Large language models show clear potential here, but also reveal notable limitations.
Nearly all models correctly answered basic contract concept questions and identified simple bugs. On gas optimization, most could pinpoint key improvement areas and analyze potential trade-offs. However, limitations emerged when deeper business logic was involved.
Take a token vesting contract: all models correctly understood its purpose, and most found several medium-to-low-risk vulnerabilities. But none detected a high-risk flaw hidden in the business logic—one that could lock up funds under specific conditions. Similar results appeared across multiple real-contract tests.
This suggests models’ understanding of contracts remains superficial, lacking deep grasp of underlying business logic. That said, with additional prompting, some models eventually identified the deeply buried vulnerability. Based on this, we can conclude that—with good engineering support—large models are already capable of serving as effective co-pilots in smart contract development. However, fully autonomous performance of critical tasks like contract auditing remains distant.
One clarification: code-related tasks in the experiment focused on contracts with less than 2,000 lines of relatively simple logic. Larger, complex projects clearly exceed current models’ capabilities without fine-tuning or sophisticated prompt engineering, so were excluded. Additionally, only Solidity was tested—other languages like Rust and Move were not included.
Beyond these, the experiment also touched on DeFi scenarios, DAOs and governance, on-chain analytics, consensus design, and tokenomics. Models showed varying degrees of capability across these areas. As many tests are ongoing and frameworks continue to evolve, I won’t delve deeper here.
Model Differences
Among all evaluated models, GPT-4o and Claude 3.5 Sonnet maintained their dominance seen in other domains—clear leaders. They consistently delivered accurate answers to basic questions and provided deep, well-reasoned analyses in complex scenarios. Even in computation—a weak spot for LLMs—they achieved relatively high success rates, though “high” is relative and still falls short of production-grade reliability.
Among open-source models, Llama 3.1-405B pulled far ahead thanks to its massive scale and advanced algorithms. Smaller open-source models showed no significant performance differentiation—despite slight score variations, all fell far below passing thresholds.
Thus, for building crypto-focused AI applications today, smaller-parameter open-source models are not viable choices.
Two models stood out in particular. First, Microsoft’s Phi-3 3.8B—the smallest in the test—achieved performance on par with 8B–12B models using less than half the parameters, excelling in certain categories. This highlights the importance of model architecture optimization and training strategies beyond mere parameter scaling.
Second, Cohere’s Command-R emerged as a surprising “dark horse”—but in reverse. Though lesser-known, Cohere is a 2B-focused LLM company whose profile seemed aligned with agent development, prompting its inclusion. Yet the 35B-parameter Command-R ranked near the bottom across most tests, outperformed even by sub-10B models.
This raises questions: Command-R emphasized retrieval-augmented generation (RAG) at launch and didn’t publish standard benchmark scores. Is it a “specialized key” that only unlocks full potential in narrow contexts?
Limitations of the Experiment
These tests offered initial insights into AI’s capabilities in crypto. However, they fall far short of professional standards. Dataset coverage is limited, quantification criteria are coarse, and a refined, accurate scoring mechanism is lacking—all affecting result precision and risking underestimation of certain models.
Methodologically, only zero-shot learning was used, without exploring chain-of-thought prompting or few-shot learning, which could unlock greater model potential. Standard model configurations were used throughout, without studying the impact of hyperparameter tuning. These uniform, simplistic methods limit our ability to fully assess model potential or detect performance variations under optimized conditions.
Despite these limitations, the experiment yielded valuable insights useful for developers building real-world applications.
The Crypto Domain Needs Its Own Benchmark
In AI, benchmarks play a pivotal role. The rapid advancement of modern deep learning traces back to ImageNet, a standardized dataset and benchmark in computer vision completed by Professor Fei-Fei Li in 2012.
By providing unified evaluation standards, benchmarks give developers clear goals and reference points, accelerating industry-wide progress. This explains why every newly released LLM prominently reports its benchmark scores. These results serve as a “common language” of model capability—helping researchers identify breakthroughs, developers select optimal models for tasks, and users make informed decisions. More importantly, benchmarks often signal future directions for AI applications, guiding investment and research focus.
If we believe the intersection of AI and crypto holds immense potential, establishing a dedicated crypto benchmark becomes an urgent task. Such a benchmark could become the crucial bridge connecting these two fields, catalyzing innovation and providing clear direction for future applications.
Yet compared to mature benchmarks in other domains, building one for crypto faces unique challenges: the technology evolves rapidly, the knowledge base is not yet stabilized, and core directions lack consensus. As an interdisciplinary field, crypto spans cryptography, distributed systems, economics, and more—far more complex than single-domain benchmarks. More challengingly, a crypto benchmark must not only assess knowledge but also evaluate AI’s practical ability to *use* crypto tools—requiring entirely new evaluation architectures. A severe lack of relevant datasets further compounds the difficulty.
The complexity and significance of this task mean it cannot be accomplished by any single individual or team. It requires collective intelligence—from users, developers, cryptographers, crypto researchers, and other cross-disciplinary experts—driven by broad community participation and consensus. Hence, developing a crypto benchmark demands wider discussion, as it is not merely a technical endeavor, but a profound reflection on how we understand this emerging technological frontier.
Postscript: Our conversation is far from over. In upcoming articles, I’ll dive deeper into concrete approaches and challenges in building an AI benchmark for crypto. The experiment continues—models are being refined, datasets expanded, evaluation frameworks improved, and automated testing infrastructure enhanced. Guided by open collaboration, all related resources—including datasets, results, evaluation frameworks, and test code—will be open-sourced as public goods.
Join TechFlow official community to stay tuned
Telegram:https://t.me/TechFlowDaily
X (Twitter):https://x.com/TechFlowPost
X (Twitter) EN:https://x.com/BlockFlow_News












