Insights from UC Berkeley's Benchmark on AI Agent Performance

By Patricia Miller

Jun 11, 2026

2 min read

UC Berkeley's benchmark highlights low AI agent performance, with an average pass rate of 2.6% on challenging tasks.

#What Does UC Berkeley's New Benchmark Reveal About AI Agent Performance?

UC Berkeley's latest benchmark calls for a reassessment of AI agent capabilities. The evaluation framework named The Agents’ Last Exam has been established with contributions from over 250 experts from more than 100 institutions. Initial results indicate that mainstream AI agents have a strikingly low average pass rate of only 2.6% on the most challenging tasks associated with real-world professional settings. In contrast, the best-performing AI agent, utilizing Codex on gpt-5-5, achieved a pass rate of roughly 26%.

This benchmark evaluates a variety of competencies across 55 non-physical sub-industries grouped into 13 clusters, based on the O*NET/SOC 2018 taxonomy. The current task catalog includes over 1,500 tasks, with aspirations to expand this to 5,000. A key aspect of this evaluation is that it focuses on producing verifiable outcomes, eliminating the possibility of misleading outputs that have been common among large language models.

The research paper detailing these findings was submitted to arXiv recently, and the benchmark can be accessed at agents-last-exam.org. It is designed to evolve continuously, growing in scope and complexity.

#How Was This Benchmark Created?

The Agents’ Last Exam project emerged from collaborative efforts spearheaded by UC Berkeley's RDI. Prominent institutions involved in the effort include MIT, Harvard, Stanford, Goldman Sachs, JPMorgan, Meta, Amazon, Adobe, and Snorkel AI.

#Why Is a 26% Top Score Significant?

The attainment of a 26% pass rate by Codex indicates that even the highest-performing AI agents struggle significantly on more complex real-world tasks. While many agents may succeed in answering isolated questions correctly, they often falter when asked to manage multi-step workflows that necessitate context retention, sequential decision-making, and the production of verified deliverables. This benchmark specifically prioritizes long-term task performance over quick responses, thus providing a more accurate picture of an AI agent’s true capabilities.

Important Notice And Disclaimer

This article does not provide any financial advice and is not a recommendation to deal in any securities or product. Investments may fall in value and an investor may lose some or all of their investment. Past performance is not an indicator of future performance.