Humans naturally obtain intuition about the interactions between and the stability of rigid objects by observing and interacting with the world. It is this intuition that governs the way in which we regularly configure objects in our environment, allowing us to build complex structures from simple, everyday objects. Robotic agents, on the other hand, traditionally require an explicit model of the world that includes the detailed geometry of each object and an analytical model of the environment dynamics, which are difficult to scale and preclude generalization. Instead, robots would benefit from an awareness of intuitive physics that enables them to similarly reason over the stable interaction of objects in their environment.
Towards that goal, we propose StackGen—a diffusion model that generates diverse stable configurations of building blocks matching a target silhouette. To demonstrate the capability of the method, we evaluate it in a simulated environment and deploy it in the real setting using a robotic arm to assemble structures generated by the model.
StackGen predicts the set of block shapes along with their six-DoF poses sufficient to collectively realize a stable 3D structure consistent with a user-provided silhouette. StackGen uses a transformer-based conditional diffusion model trained on diverse examples of stable 3D structures paired with different forms of conditioning.
StackGen uses (below-left) a "construction by deconstruction" strategy to generate diverse set of stable structures. The procedure first involves building an initial structure out of a dense collection of different blocks. While the structure remains stable, we iteratively remove a randomly chosen block, adding each stable structure to the dataset. This results in (below-right) a diverse collection of stable stacks.
The following is a visualization of the diversity of stable stacks that StackGen produces for six different input silhouettes in simulation.
We evaluate StackGen on a held-out test set of silhouettes paired with their 3D structures, and compare with two block stacking baselines:
In our real-world experiments, we use a UR-5 robot to construct the physical structure identified by
In this setting, the user provides an RGB image of a reference stack taken from the perspective of the front view. The goal is to identify the number and type (shape) of blocks, as well as their six-DoF pose such that the resulting structure is stable and matches the silhouette of the reference stack. There is no explicit objective to match the color or the shape of the individual blocks.
In this setting, the user provides a hand-drawn sketch of the desired structure. The task then involves determining the number and type of blocks along with their poses that result in a structure that is both stable and consistent with the reference sketch.
@article{sun2024stackgengeneratingstablestructures,
title={{StackGen}: {G}enerating Stable Structures from Silhouettes via Diffusion},
author={Luzhe Sun and Takuma Yoneda and Samuel W. Wheeler and Tianchong Jiang and Matthew R. Walter},
journal={arXiv preprint arXiv:2409.18098},
year={2024}
}