Do You Know Where Your Camera Is?

View-Invariant Policy Learning with Camera Conditioning

Tianchong Jiang1 Jingtian Ji1 Xiangshan Tan1 Jiading Fang2,* Anand Bhattad3 Vitor Guizilini4,† Matthew R. Walter1,†
1 Toyota Technological Institute at Chicago (TTIC) 2 Waymo 3 Johns Hopkins University 4 Toyota Research Institute (TRI)
* Work completed while at TTIC † Equal advising
Real-world camera poses
Real-world camera poses
Simulation front view
Simulation front view

View-invariant policies via explicit camera conditioning.

Key Findings

  • Plücker is always a gain: Conditioning with per-pixel Plücker ray-maps increases success across ACT, Diffusion Policy, and SmolVLA on all six tasks.
  • Bigger gains in the randomized setting: Benefits are more significant when backgrounds vary, where policies cannot rely on static scene cues.
  • Cropping is crucial: Applying random cropping jointly to images and Plücker maps yields consistent boosts—effectively acting as “virtual cameras.”
  • Action space matters: Delta end-effector pose performs best; nonetheless, camera conditioning helps across all action spaces.
Main results table
Main results (simulated tasks)
Effect of cropping on performance
Cropping improves success
Action space comparison
Action space ablation
Fixed vs randomized comparison
Fixed vs. randomized views

Method

We explicitly condition visuomotor policies on camera geometry using per-pixel Plücker ray embeddings. Given camera intrinsics and extrinsics, each pixel is mapped to a 6D ray representation that is concatenated with image features. Implementation details consistent with the paper:

  • With pretrained encoders: encode the Plücker map with a small CNN to the image-latent dimension, then late-fuse via channel-wise concatenation.
  • Without pretrained encoders: concatenate the Plücker map with the image at input and learn end-to-end.
Architecture (no pretraining)
No pretraining: input concatenation
Architecture (pretrained)
Pretrained encoder: encode rays, late fuse

Benchmarks

  • Six tasks across two simulators
  • Paired fixed vs randomized variants
Top: fixed | Bottom: randomized
Lift fixed
Lift
Pick Place Can (fixed)
Pick Place Can
Assembly Square (fixed)
Assembly Square
Push fixed
Push
Lift Upright (fixed)
Lift Upright
Roll Ball (fixed)
Roll Ball
Lift (randomized)
Lift
Pick Place Can (randomized)
Pick Place Can
Assembly Square (randomized)
Assembly Square
Push (randomized)
Push
Lift Upright (randomized)
Lift Upright
Roll Ball (randomized)
Roll Ball

Evaluation Protocol

We follow a standardized evaluation to ensure fair, stable estimates:

  • Train with three random seeds; evaluate the last ten checkpoints per run.
  • For each checkpoint, run 50 rollouts on distinct initial-state × camera-pose pairs (1,500 rollouts per setting for ACT/DP; 500 for SmolVLA).
  • Use fixed environment seeds so all methods face identical stochasticity and initializations.

Real-World Experiments

  • UR5 robot; three movable third‑person cameras
  • Tasks: Pick Place, Plate Insertion, Hang Mug
  • Success counts across randomized camera poses
Pick Place task
Pick Place
Plate Insertion task
Plate Insertion
Hang Mug task
Hang Mug
Success counts across settings
Success counts (per setting)

Resources

  • Paper (PDF)
  • Benchmarks, demonstrations, and code: coming soon.

BibTeX

@article{jiang2025knowyourcamera,
  title     = {Do You Know Where Your Camera Is? {V}iew-Invariant Policy Learning with Camera Conditioning},
  author    = {Tianchong Jiang and Jingtian Ji and Xiangshan Tan and Jiading Fang and Anand Bhattad and Vitor Guizilini and Matthew R. Walter},
  journal   = {arXiv preprint arXiv:2510.02268},
  year      = {2025},
}