What Are We Actually Benchmarking in Robot Manipulation?
Tianchong Jiang, Xiangshan Tan, Samuel Wheeler, Luzhe Sun, Tewodros W. Ayalew, Matthew Walter
TTIC, University of Chicago, Argonne National Laboratory
We audit widely used robot manipulation benchmarks with diagnostics for shortcut solvability, statistical significance, creeping overfitting, and data-source dependence.
Cumulative counts of arXiv papers reporting results on CALVIN, LIBERO, SimplerEnv, RoboCasa, and RoboTwin 2.0.
Diagnostics
Shortcut Solvability
A benchmark score is evidence of capability only if any policy achieving it has the capability. We show that for LIBERO, a simple 0.09B probe with no language encoder and no large-scale robotics pretraining scores at or near the best reported result, while on RoboTwin 2.0 and RoboCasa it scores well below the best reported result.
Statistical Significance
Most benchmarks report only an aggregate success rate, from which the significance of a reported gain cannot be tested. On LIBERO and SimplerEnv, only 19.8% and 19.7% of SOTA claims, respectively, are provably significant. On RoboCasa and RoboTwin 2.0, the shares rise to 53.3% and 73.7%, respectively.
Creeping Overfitting
A benchmark's test distribution can be a narrow region of the training distribution's support, and its fixed test set is one sample from that region. Policies can fit either, which we call distribution overfitting and sample overfitting. On CALVIN, resampling block poses within the training range lowers average tasks completed out of five (ATC) from 4.17 to 3.14.
Data-Source Dependence
A small model trained on demonstrations closer to the test distribution can match a much larger model trained on more general data. As a result, a benchmark score cannot distinguish generalization from how close the training data is to the test distribution.