Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning

1 Toyota Technological Institute at Chicago (TTIC) 2 Waymo 3 Johns Hopkins University 4 Toyota Research Institute (TRI)
* Work completed while at TTIC † Equal advising

Key Findings

  • Canera poses always help: Conditioning with per-pixel Plücker ray-maps increases success across ACT, Diffusion Policy, and SmolVLA on all six tasks in simulation and three tasks in the real-world.
  • Bigger gains in the randomized setting: Benefits are more significant when backgrounds vary, where policies cannot rely on static scene cues to infer camera poses.
  • Cropping is crucial: Applying random cropping jointly to images and Plücker gives significant boosts.
  • Action space matters: Delta end-effector pose performs best; nonetheless, camera conditioning helps across all action spaces.
Main results table
Main results (simulated tasks)
Effect of cropping on performance
Cropping improves success
Action space comparison
Action space ablation
Fixed vs randomized comparison
Fixed vs. randomized settings

Method

We explicitly condition imitation learning policies on camera geometry using per-pixel Plücker ray embeddings. Given camera intrinsics and extrinsics, each pixel is mapped to a 6D ray representation that is concatenated with image features.

  • With pretrained encoders: encode the Plücker map with a small CNN to the image-latent dimension, then concatenate channel-wise.
  • Without pretrained encoders: concatenate the image with the Plücker map channel-wise.
Architecture (no pretraining)
No pretraining: input concatenation
Architecture (pretrained)
Pretrained encoder: encode rays, late fuse

Benchmarks

  • Six tasks across two simulators
  • Paired fixed vs randomized variants
Top: fixed | Bottom: randomized
Lift fixed
Lift
Pick Place Can (fixed)
Pick Place Can
Assembly Square (fixed)
Assembly Square
Push fixed
Push
Lift Upright (fixed)
Lift Upright
Roll Ball (fixed)
Roll Ball
Lift (randomized)
Lift
Pick Place Can (randomized)
Pick Place Can
Assembly Square (randomized)
Assembly Square
Push (randomized)
Push
Lift Upright (randomized)
Lift Upright
Roll Ball (randomized)
Roll Ball

Real-World Experiments

  • UR5 robot; three movable third‑person cameras
  • Tasks: Pick Place, Plate Insertion, Hang Mug
  • Success counts across randomized camera poses
Pick Place task
Pick Place
Plate Insertion task
Plate Insertion
Hang Mug task
Hang Mug
Success counts across settings
Success counts (per setting)

Plücker Snippet

To add camera conditioning to your policy, you can use the following minimalist snippet to get Plücker raymap from intrinsics and extrinsics. (It assumes OpenCV convention i.e. image origin at top-left, +z is forward.)

import torch

def get_plucker_raymap(K, c2w, height, width):
    """intrinsics (3,3), cam2world (4,4), height int, width int"""
    vv, uu = torch.meshgrid(
        torch.arange(height, device=K.device, dtype=K.dtype) + 0.5,
        torch.arange(width, device=K.device, dtype=K.dtype) + 0.5,
        indexing="ij",
    )
    rays = torch.stack([uu, vv, torch.ones_like(uu)], dim=-1)
    d_world = torch.nn.functional.normalize(
        (rays @ torch.linalg.inv(K).T) @ c2w[:3, :3].T,
        dim=-1,
        eps=1e-9,
    )
    o = c2w[:3, 3].view(1, 1, 3)
    m = torch.cross(o, d_world, dim=-1)
    return torch.cat([d_world, m], dim=-1)                         

BibTeX

@article{jiang2025knowyourcamera,
  title     = {Do You Know Where Your Camera Is? {V}iew-Invariant Policy Learning with Camera Conditioning},
  author    = {Tianchong Jiang and Jingtian Ji and Xiangshan Tan and Jiading Fang and Anand Bhattad and Vitor Guizilini and Matthew R. Walter},
  journal   = {arXiv preprint arXiv:2510.02268},
  year      = {2025},
}