Manipulation Benchmark Audit

Robot manipulation benchmark audit

What Are We Actually
Benchmarking in Robot
Manipulation?

Tianchong Jiang, Xiangshan Tan, Samuel Wheeler, Luzhe Sun, Tewodros W. Ayalew, Matthew Walter

TTIC, University of Chicago, Argonne National Laboratory

We audit widely used robot manipulation benchmarks with diagnostics for shortcut solvability, statistical significance, creeping overfitting, and data-source dependence.

arXiv Coming soon Code Artifacts Coming soon
Cumulative counts of arXiv papers reporting results on CALVIN, LIBERO, SimplerEnv, RoboCasa, and RoboTwin 2.0.
Cumulative counts of arXiv papers reporting results on CALVIN, LIBERO, SimplerEnv, RoboCasa, and RoboTwin 2.0.

Diagnostics

Shortcut Solvability

Paper result figure for the shortcut solvability diagnostic.

A benchmark score is evidence of capability only if any policy achieving it has the capability. We show that for LIBERO, a simple 0.09B probe with no language encoder and no large-scale robotics pretraining scores at or near the best reported result, while on RoboTwin 2.0 and RoboCasa it scores well below the best reported result.

Statistical Significance

Paper result figure for the statistical significance diagnostic.

Most benchmarks report only an aggregate success rate, from which the significance of a reported gain cannot be tested. On LIBERO and SimplerEnv, only 19.8% and 19.7% of SOTA claims, respectively, are provably significant. On RoboCasa and RoboTwin 2.0, the shares rise to 53.3% and 73.7%, respectively.

Creeping Overfitting

Paper result figure for the creeping overfitting diagnostic.

A benchmark's test distribution can be a narrow region of the training distribution's support, and its fixed test set is one sample from that region. Policies can fit either, which we call distribution overfitting and sample overfitting. On CALVIN, resampling block poses within the training range lowers average tasks completed out of five (ATC) from 4.17 to 3.14.

Data-Source Dependence

Paper result figure for the data-source dependence diagnostic.

A small model trained on demonstrations closer to the test distribution can match a much larger model trained on more general data. As a result, a benchmark score cannot distinguish generalization from how close the training data is to the test distribution.

Benchmarks

  • LIBERO
  • CALVIN
  • SimplerEnv
  • RoboCasa
  • RoboTwin 2.0