AI agents still majorly struggle with real-world work

A new benchmark finds office automation may be further off than hype would have you believe.

November 6, 2025

• 4 min read

Like many human workers, AI agents might struggle to make a living off online gig work.

For these autonomous AI systems, however, it’s not for lack of work, but because they overwhelmingly fail to complete projects sourced from online freelance platforms.

That’s what researchers at the Center for AI Safety along with Scale AI found when they tested agents on their ability to perform economically valuable work as part of a new benchmark, the Remote Labor Index (RLI). (The “highest-performing agent” managed a piddly automation rate of just 2.5%.)

Plenty of existing benchmarks measure specific skills like coding or basic office tasks, and agents have come to master those. But the creators of the RLI wanted to see how those abilities translate into real-world settings where projects might encompass more complex work of greater variety.

The authors said their findings should serve to ground some of the more provocative claims around AI-fueled automation replacing human workers. Anthropic CEO Dario Amodei, for instance, has predicted that AI could wipe out half of all white collar jobs in the next few years.

“Currently, these agents can’t automate people’s jobs, but AI development moves fast, so things could look very different in five years,” Mantas Mazeika, lead researcher at CAIS, said in an email. “We hope RLI can help provide clarity here and enable policymakers and the public to proactively navigate AI-driven labor automation.”

The benchmark consists of 240 end-to-end freelance projects across categories like product and graphic design, game development, and architecture, as well as successful examples of completed projects from human professionals. The projects are sourced directly from online freelance platforms.

Even the best agent tested (from the Chinese company Manus) was only able to earn $1,720 of the available $143,991 in freelance payment on offer, followed by Anthropic’s Sonnet 4.5, with $1,280, and GPT-5, “earning” $1,180.

Agents would submit work in corrupt or empty files, fail to complete pieces of the instructions, or provide shoddy or inconsistent material, according to the paper.

“There are areas that today’s agents can do really well. Those are…creative tasks like writing a report, solving software problems, or doing some data synthesis integrations,” Bing Liu, Scale AI’s director of research, told Tech Brew. “But for the real-world tasks that require complex interaction with tools, understanding ambiguous requirements in briefs, and completing [long-running] tasks…those types of tasks still challenge most of these frontier models today.”

The findings contrast sharply with the ongoing hype around agents in the office. Companies like Microsoft and Salesforce have painted a future where specialized agents tackle most day-to-day work tasks with human employees taking on more of a supervisory role.

Liu said that while these agents might excel at benchmarks devised specifically to test their abilities, real-world use tends to present more varying and complex challenges than what’s in the test. That real-world setting is what the team is trying to capture with this new index.

“Synthetic tasks, most of the time, are created for the sake of challenging models and agents in a certain way for specific capabilities, but real-world tasks are a lot more different than that,” Liu said. “The implication to business leaders is that we should continue to monitor and measure AI's progress in these real-world tasks, the real workflows that they truly care about that bring real economic value. And we have a long way there.”

Keep up with the innovative tech transforming business

Keep up with the innovative tech transforming business