Can you earn $400,000 by letting AI programming do the work?

2025.02.20

Can you earn $400,000 by letting AI programming do the work?

AI hasn't replaced programmers to the extent that some might exaggerate.

2025.02.20 - 09:06:22

Navigating Web3 tides with focused insights

AI hasn't replaced programmers to the extent that some might exaggerate.

Author: Tan Zixin,TopTech

Image source: Generated by Wujie AI

Large language models (LLMs) are transforming the way software is developed, and whether AI can now replace human programmers at scale has become a topic of significant industry interest.

In just two years, AI large models have evolved from solving basic computer science problems to competing with top human performers in international programming contests. For example, OpenAI's o1 participated in the 2024 International Olympiad in Informatics (IOI) under the same conditions as human contestants and successfully won a gold medal, demonstrating strong programming potential.

Meanwhile, the AI iteration rate is accelerating. On the code generation benchmark SWE-Bench Verified, GPT-4o scored 33% in August 2024, while the newer o3 model has doubled that score to 72%.

To better measure AI models' software engineering capabilities in the real world, OpenAI today open-sourced a new evaluation benchmark called SWE-Lancer, linking model performance directly to monetary value for the first time.

SWE-Lancer is a benchmark containing over 1,400 freelance software engineering tasks sourced from Upwork, with a total real-world compensation value of approximately $1 million—how much money can AI earn by programming?

The "Features" of the New Benchmark

The task prices in SWE-Lancer reflect actual market values—the harder the task, the higher the pay.

It includes both individual engineering tasks and management tasks, allowing choices between technical implementation plans. This benchmark targets not only programmers but also entire development teams, including architects and managers.

Compared to previous software engineering benchmarks, SWE-Lancer offers several advantages:

1. All 1,488 tasks represent real payments made by employers to freelance engineers, providing a natural, market-determined difficulty gradient, with rewards ranging from $250 to $32,000—a substantial range.

Among them, 35% of tasks are valued above $1,000, and 34% fall between $500 and $1,000. The individual contributor (IC) software engineering (SWE) group contains 764 tasks totaling $414,775; the SWE management tasks group includes 724 tasks totaling $585,225.

2. Large-scale software engineering in the real world requires not just coding ability but also technical coordination and management. This benchmark evaluates models in the role of an SWE "technical lead" using real-world data.

3. Advanced full-stack engineering evaluation capability. SWE-Lancer reflects real-world software engineering because its tasks come from platforms serving millions of real users.

The tasks involve mobile and web development, interactions with APIs, browsers, and external applications, as well as verification and reproduction of complex issues.

For instance, some tasks include spending $250 to improve reliability (fixing double-triggered API calls), $1,000 to fix vulnerabilities (resolving permission discrepancies), or $16,000 to implement new features (adding in-app video playback support across web, iOS, Android, and desktop).

4. Domain diversity. 74% of IC SWE tasks and 76% of SWE management tasks involve application logic, while 17% of IC SWE tasks and 18% of SWE management tasks relate to UI/UX development.

In terms of difficulty, the tasks selected for SWE-Lancer are highly challenging—tasks in the open-source dataset take an average of 26 days to resolve on GitHub.

In addition, OpenAI states that to ensure unbiased data collection, they selected representative task samples from Upwork and hired 100 professional software engineers to write and validate end-to-end tests for all tasks.

AI Coding Earnings: Performance Comparison

Although many tech leaders frequently claim in public statements that AI models can replace "junior" engineers, whether companies can fully substitute LLMs for human software engineers remains a big question.

The initial evaluation results show that on the full SWE-Lancer dataset, even the top-performing AI models currently tested earn far less than the $1 million potential total reward.

Overall, all models perform better on SWE management tasks than on IC SWE tasks. IC SWE tasks remain largely unsolved by AI models. Among the tested models, the best performer so far is Claude 3.5 Sonnet, developed by OpenAI's competitor Anthropic.

On IC SWE tasks, all models have single-pass success rates and return rates below 30%. On SWE management tasks, the top-performing model, Claude 3.5 Sonnet, scores 45%.

Claude 3.5 Sonnet demonstrates strong performance on both IC SWE and SWE management tasks, outperforming the second-best model o1 by 9.7% on IC SWE tasks and by 3.4% on SWE management tasks.

In monetary terms, the highest-performing Claude 3.5 Sonnet earns over $400,000 across the full dataset.

A notable point is that higher reasoning compute significantly boosts "AI earnings."

In experiments on IC SWE tasks using the o1 model equipped with deep reasoning tools, increased reasoning compute raised the single-pass rate from 9.3% to 16.5%, increasing earnings from $16,000 to $29,000 and the return rate from 6.8% to 12.1%.

Researchers concluded that although the best model, Claude 3.5 Sonnet, solved 26.2% of IC SWE problems, most remaining solutions still contained errors, requiring significant refinement before reliable deployment. Following it are o1 and then GPT-4o, and the single-pass rate on management tasks is typically more than twice that of IC SWE tasks.

This implies that despite the hype around AI agents replacing human software engineers, companies should still proceed with caution. AI models can solve some "basic" coding problems but cannot yet replace "junior" software engineers because they fail to understand why certain code errors occur and often make further cascading mistakes.

The current evaluation framework does not support multimodal inputs, and researchers have not yet assessed "return on investment," such as comparing payment to freelancers versus API usage costs when completing a task—this will be a key focus for future improvements to the benchmark.

Becoming an "AI-Augmented" Programmer

For now, AI still has a long way to go before truly replacing human programmers, since developing a software project involves much more than simply generating code according to specifications.

For example, programmers often face extremely complex, abstract, and ambiguous client requirements, which require deep understanding of various technical principles, business logic, and system architectures. When optimizing complex software architectures, human programmers can comprehensively consider future scalability, maintainability, and performance, whereas AI may struggle to make holistic analytical judgments.

Moreover, programming is not just about implementing existing logic—it requires substantial creativity and innovative thinking. Programmers need to devise new algorithms, design unique software interfaces and interaction methods—areas where AI still falls short.

Programmers also typically need to communicate and collaborate with team members, clients, and other stakeholders, understand diverse needs and feasibility constraints, clearly articulate their own ideas, and work collaboratively to complete projects. In addition, human programmers possess the ability to continuously learn and adapt to new changes, quickly mastering new knowledge and skills and applying them to real projects—something a successful AI model still requires extensive training and testing for.

The software development industry is also subject to various legal and regulatory constraints, such as intellectual property, data protection, and software licensing. AI may struggle to fully comprehend and comply with these legal requirements, potentially introducing legal risks or liability disputes.

In the long run, job displacement due to AI advancements remains possible, but in the short term, "AI-augmented programmers" are the mainstream trend. Mastering the use of the latest AI tools is becoming one of the core competencies of top-tier programmers.

Join TechFlow official community to stay tuned

Telegram:https://t.me/TechFlowDaily

X (Twitter):https://x.com/TechFlowPost

X (Twitter) EN:https://x.com/BlockFlow_News

Source

Add to Favorites

Share to Social Media

Author

头部科技

Can you earn $400,000 by letting AI programming do the work?

TechFlow Selected TechFlow Selected