#What is Arbor and How Does It Stand Out in AI?
Arbor, developed by researchers at Renmin University of China’s Gaoling School of Artificial Intelligence alongside Microsoft Research, has been making waves in the artificial intelligence domain since its introduction on June 10, 2026. This open-source framework has surpassed the performance of both OpenAI’s Codex and Anthropic’s Claude Code by more than two and a half times in average relative held-out gains across six distinct autonomous optimization tasks. Impressively, Arbor also achieved the best results on held-out tests for every evaluated task, indicating its robust design and execution.
#How Does Arbor Function?
Understanding how Arbor operates is crucial for appreciating its effectiveness. The framework employs a method known as Hypothesis-Tree Refinement, or HTR. This technique organizes optimization tasks into a branching tree structure that consists of hypotheses, experiments, evidence, and insights. Each branch builds upon the earlier trials instead of treating each attempt in isolation, which is a significant departure from traditional methods.
Arbor utilizes a two-layer architecture. The long-lived coordinator agent plays a strategic role, determining which hypotheses to pursue and how to sequence various experiments. In contrast, the short-lived executor agents carry out these experiments within controlled environments. Once an executor completes a task, it reports findings back to the coordinator, which then refines its strategy for subsequent rounds.
#What Are the Numbers Behind Arbor's Performance?
When looking at the benchmark data, Arbor stands out across six different autonomous optimization tasks that include model training and data synthesis. Its performance exhibited an outstanding score of over 2.5 times the average relative held-out gain when compared to both Codex and Claude Code. Arbor also led in held-out test results across all tested tasks.
Specifically, in the MLE-Bench Lite—a recognized benchmark for machine learning engineering—Arbor, which implements GPT-5.5, achieved an Any-Medal score of 86.36%. This score reflects the percentage of tasks in which Arbor performed exceptionally well, earning at least a bronze-level distinction.
In an accuracy assessment on BrowseComp, Arbor achieved a score of 67.67, notably higher than Claude Code’s 53.33, further solidifying its advantageous position in the market.
Arbor is accessible to the public through its GitHub repository at RUC-NLPIR/Arbor. This framework is equipped with a command-line interface runtime and specialized skill sets designed for seamless integration with other coding agents, which further enhances its usability and appeal.