Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning

Tianchong Jiang¹ Jingtian Ji¹ Xiangshan Tan¹ Jiading Fang^2,* Anand Bhattad³ Vitor Guizilini^4,† Matthew R. Walter^1,†

¹ Toyota Technological Institute at Chicago (TTIC) ² Waymo ³ Johns Hopkins University ⁴ Toyota Research Institute (TRI)

* Work completed while at TTIC † Equal advising

arXiv Code Dataset

Key Findings

Canera poses always help: Conditioning with per-pixel Plücker ray-maps increases success across ACT, Diffusion Policy, and SmolVLA on all six tasks in simulation and three tasks in the real-world.
Bigger gains in the randomized setting: Benefits are more significant when backgrounds vary, where policies cannot rely on static scene cues to infer camera poses.
Cropping is crucial: Applying random cropping jointly to images and Plücker gives significant boosts.
Action space matters: Delta end-effector pose performs best; nonetheless, camera conditioning helps across all action spaces.

Main results table — Main results (simulated tasks)

Effect of cropping on performance — Cropping improves success

Action space comparison — Action space ablation

Fixed vs randomized comparison — Fixed vs. randomized settings

Method

We explicitly condition imitation learning policies on camera geometry using per-pixel Plücker ray embeddings. Given camera intrinsics and extrinsics, each pixel is mapped to a 6D ray representation that is concatenated with image features.

With pretrained encoders: encode the Plücker map with a small CNN to the image-latent dimension, then concatenate channel-wise.
Without pretrained encoders: concatenate the image with the Plücker map channel-wise.

Architecture (no pretraining) — No pretraining: input concatenation

Architecture (pretrained) — Pretrained encoder: encode rays, late fuse

Benchmarks

Six tasks across two simulators
Paired fixed vs randomized variants

Top: fixed | Bottom: randomized

Assembly Square (fixed) — Assembly Square

Pick Place Can (randomized) — Pick Place Can

Assembly Square (randomized) — Assembly Square

Lift Upright (randomized) — Lift Upright

Real-World Experiments

UR5 robot; three movable third‑person cameras
Tasks: Pick Place, Plate Insertion, Hang Mug
Success counts across randomized camera poses

Success counts across settings — Success counts (per setting)

Plücker Snippet

To add camera conditioning to your policy, you can use the following minimalist snippet to get Plücker raymap from intrinsics and extrinsics. (It assumes OpenCV convention i.e. image origin at top-left, +z is forward.)

import torch

def get_plucker_raymap(K, c2w, height, width):
    """intrinsics (3,3), cam2world (4,4), height int, width int"""
    vv, uu = torch.meshgrid(
        torch.arange(height, device=K.device, dtype=K.dtype) + 0.5,
        torch.arange(width, device=K.device, dtype=K.dtype) + 0.5,
        indexing="ij",
    )
    rays = torch.stack([uu, vv, torch.ones_like(uu)], dim=-1)
    d_world = torch.nn.functional.normalize(
        (rays @ torch.linalg.inv(K).T) @ c2w[:3, :3].T,
        dim=-1,
        eps=1e-9,
    )
    o = c2w[:3, 3].view(1, 1, 3)
    m = torch.cross(o, d_world, dim=-1)
    return torch.cat([d_world, m], dim=-1)

BibTeX

@article{jiang2025knowyourcamera,
  title     = {Do You Know Where Your Camera Is? {V}iew-Invariant Policy Learning with Camera Conditioning},
  author    = {Tianchong Jiang and Jingtian Ji and Xiangshan Tan and Jiading Fang and Anand Bhattad and Vitor Guizilini and Matthew R. Walter},
  journal   = {arXiv preprint arXiv:2510.02268},
  year      = {2025},
}