Robot manipulation benchmark audit

What Are We Actually
Benchmarking in Robot
Manipulation?

Tianchong Jiang¹, Xiangshan Tan^1,*, Samuel Wheeler^3,*, Luzhe Sun^1,*, Tewodros W. Ayalew^2,*, Matthew Walter¹

¹ Toyota Technological Institute at Chicago, Chicago, IL, USA² University of Chicago, Chicago, IL, USA³ Argonne National Laboratory, Lemont, IL, USA^* Equal contribution

We audit widely used robot manipulation benchmarks with diagnostics for shortcut solvability, statistical significance, creeping overfitting, and data-source dependence.

arXiv Code & Artifacts

Cumulative counts of arXiv papers reporting results on CALVIN, LIBERO, SimplerEnv, RoboCasa, and RoboTwin 2.0.

Diagnostics

Shortcut Solvability

A benchmark score is evidence of capability only if any policy achieving it has the capability. We show that for LIBERO, a simple 0.09B probe with no language encoder and no large-scale robotics pretraining scores at or near the best reported result, while on RoboTwin 2.0 and RoboCasa it scores well below the best reported result.

Statistical Significance

Most benchmarks report only an aggregate success rate, from which the significance of a reported gain cannot be tested. On LIBERO and SimplerEnv, only 19.8% and 19.7% of SOTA claims, respectively, are provably significant. On RoboCasa and RoboTwin 2.0, the shares rise to 53.3% and 73.7%, respectively.

Creeping Overfitting

A benchmark's test distribution can be a narrow region of the training distribution's support, and its fixed test set is one sample from that region. Policies can fit either, which we call distribution overfitting and sample overfitting. On CALVIN, resampling block poses within the training range lowers average tasks completed out of five (ATC) from 4.17 to 3.14.

Data-Source Dependence

A small model trained on demonstrations closer to the test distribution can match a much larger model trained on more general data. As a result, a benchmark score cannot distinguish generalization from how close the training data is to the test distribution.

Benchmarks

LIBERO
CALVIN
SimplerEnv
RoboCasa
RoboTwin 2.0