StackGen: Generating Stable Structures from Silhouettes via Diffusion

1Toyota Technological Institute at Chicago, 2Argonne National Laboratory *Equal Contribution

StackGen takes as input a user's hand-drawn sketch and identifies the different block shapes and their pose that together form a stable structure that is consistent with the sketch. A UR5 arm then builds the physical structure.

Abstract

Humans naturally obtain intuition about the interactions between and the stability of rigid objects by observing and interacting with the world. It is this intuition that governs the way in which we regularly configure objects in our environment, allowing us to build complex structures from simple, everyday objects. Robotic agents, on the other hand, traditionally require an explicit model of the world that includes the detailed geometry of each object and an analytical model of the environment dynamics, which are difficult to scale and preclude generalization. Instead, robots would benefit from an awareness of intuitive physics that enables them to similarly reason over the stable interaction of objects in their environment.

Towards that goal, we propose StackGen—a diffusion model that generates diverse stable configurations of building blocks matching a target silhouette. To demonstrate the capability of the method, we evaluate it in a simulated environment and deploy it in the real setting using a robotic arm to assemble structures generated by the model.

Model Architecture

Transformer-based Diffusion Model

StackGen predicts the set of block shapes along with their six-DoF poses sufficient to collectively realize a stable 3D structure consistent with a user-provided silhouette. StackGen uses a transformer-based conditional diffusion model trained on diverse examples of stable 3D structures paired with different forms of conditioning.

Interpolate start reference image.

Construction by Deconstruction

StackGen uses (below-left) a "construction by deconstruction" strategy to generate diverse set of stable structures. The procedure first involves building an initial structure out of a dense collection of different blocks. While the structure remains stable, we iteratively remove a randomly chosen block, adding each stable structure to the dataset. This results in (below-right) a diverse collection of stable stacks.

Interpolate start reference image.

Experiments

Qualitative Results

The following is a visualization of the diversity of stable stacks that StackGen produces for six different input silhouettes in simulation.

Interpolate start reference image.

Quantitative Comparison

We evaluate StackGen on a held-out test set of silhouettes paired with their 3D structures, and compare with two block stacking baselines:

  • Brute-Force: The algorithm takes as input the predicted list of shapes for the given silhouette. For each block in the list, it searchs for the positions that best matches the silhouette, while avoiding collisions with other blocks.
  • Greedy-Random: The algorithm takes as input the predicted list of shapes for the given silhouette. The algorithm greedily places the largest block remaining in the list in the first position that matches the silhouette, working from left-to-right, bottom-to-top.
We use a physics simulator to test the stability of each of the resulting structures and measure their consistency with the ground-truth target stack in terms of their orthographic projections (i.e., front, top, and side views). The results reveal that StackGen generates structures that are more stable and that better match the input specification.

Interpolate start reference image.

Real-World Experiments

In our real-world experiments, we use a UR-5 robot to construct the physical structure identified by StackGen when provided with different specifications of the desired structure.

Stack→Stack

In this setting, the user provides an RGB image of a reference stack taken from the perspective of the front view. The goal is to identify the number and type (shape) of blocks, as well as their six-DoF pose such that the resulting structure is stable and matches the silhouette of the reference stack. There is no explicit objective to match the color or the shape of the individual blocks.

Interpolate start reference image.

Sketch→Stack

In this setting, the user provides a hand-drawn sketch of the desired structure. The task then involves determining the number and type of blocks along with their poses that result in a structure that is both stable and consistent with the reference sketch.

Interpolate start reference image.

BibTeX

@article{sun2024stackgengeneratingstablestructures,
      title={{StackGen}: {G}enerating Stable Structures from Silhouettes via Diffusion}, 
      author={Luzhe Sun and Takuma Yoneda and Samuel W. Wheeler and Tianchong Jiang and Matthew R. Walter},
      journal={arXiv preprint arXiv:2409.18098},
      year={2024}
    }