Sylvain's Blog

A new large-scale study from researchers at the Center for AI Safety and Scale AI introduces the Remote Labor Index (RLI), a benchmark designed to measure how well AI systems can complete real, economically valuable remote work. Unlike traditional benchmarks based on puzzles or short prompts, this one evaluates full projects similar to freelance jobs on platforms like Upwork.

The results are sobering: frontier AI agents successfully completed only a tiny fraction of real-world projects to a client-acceptable standard. Despite impressive performance on reasoning benchmarks, current AI systems remain far from automating most knowledge work.

How This Study Was Designed

Most AI benchmarks test isolated abilities: answering questions, writing short code snippets, or solving structured problems. But real jobs require sustained execution: dealing with messy data, unclear instructions, file formats, and iterative requirements.

The researchers built the Remote Labor Index, a benchmark made from hundreds of real freelance projects spanning fields like software, design, architecture, and data analysis. Each project had measurable economic value and realistic completion times, often requiring many hours of work.

AI agents were tasked with completing these projects end-to-end. Outputs were judged by whether a client would reasonably accept the work, not whether it looked plausible.

How Well AI Actually Performed

The headline finding: modern AI systems perform near the floor on real work.

The best agent achieved an automation rate of only about 2.5%. In other words, over 97% of real freelance projects could not be completed to acceptable quality by AI alone.

This contrasts sharply with near saturated performance on academic and coding benchmarks. The gap suggests current evaluations dramatically overestimate real economic capability.

Why AI Failed

Failures were rarely due to lack of raw knowledge. Instead, they came from practical execution problems: the kinds humans solve constantly during real work.

Common failure modes included:

incomplete project execution,
broken or incompatible files,
inconsistent outputs across steps,
poor handling of ambiguous instructions,
inability to recover from small mistakes.

In short, the challenge was not intelligence in isolation, but reliable agency over long tasks. Real work requires coordination, persistence, and correction, areas where AI agents still struggle.

What This Means for Automation

The findings suggest the near-term impact of AI is more likely augmentation than replacement.

Benchmarks showing superhuman reasoning do not automatically translate into economic automation. Even highly capable models fail when asked to operate continuously in open-ended environments.

This has two major implications:

Many jobs are safer than headline benchmarks imply
The main value of AI today is accelerating humans, not replacing them

The study therefore reframes the automation debate: the bottleneck is not knowledge, but reliable execution in messy real contexts.

Important Caveats

The benchmark measures autonomous agents working independently. Human/AI collaboration may achieve far higher performance.

Additionally, progress is measurable. Even if low today, automation rates have been improving. The study does not claim AI cannot automate work, only that current capabilities are far from full replacement.

The Remote Labor Index is best understood as a baseline: a realistic measurement of where AI stands in the real economy today, not a prediction of where it will be tomorrow.

Read the full paper here

AI Still Can’t Do Most Real Jobs

How This Study Was Designed

How Well AI Actually Performed

Why AI Failed

What This Means for Automation

Important Caveats