Abstract
Imitation learning has emerged as a crucial approach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods often struggle to generalize under spatial and visual randomizations, instead tending to overfit. To address this challenge, we propose Visual- Geometry Diffusion Policy (VGDP), a multimodal imitation learning framework that fuses RGB images and point clouds through a cross-attention mechanism, fully leveraging 3D representations while incorporating the long-overlooked benefits of 2D visual features. Across a benchmark of 18 simulated tasks and 3 real-world tasks using 7 different observation encoders, VGDP prevails other policies with an average performance improvement of 39.1%. More importantly, VGDP demonstrates strong robustness under visual and spatial perturbations, surpassing baselines with an average improvement of 41.5% in different visual conditions and 15.2% in different spatial settings.
Key Attributes
Effective
VGDP is evaluated across simulation and real-world domains against 6 baselines. Simulation experiments cover LIBERO, ManiSkill, and RLBench, each with three domain randomization levels, while real-world tests span three diverse manipulation tasks. We are excited to observe that VGDP consistently outperforms all baselines by a significant margin in both simulation and real-world settings, with an average improvement of 39.1%.
Robust
We evaluate VGDP under diverse visual and spatial perturbations. Across shifts in position, lighting, texture, and camera viewpoint, VGDP maintains a consistently high and stable success rate, highlighting its strong robustness and generalization ability.
Transferable
VGDP demonstrates strong 0-shot transferability across diverse domains and tasks even trained within a narrow distribution. It transfers to unseen lighting, positions and objects.
Methodology
By jointly encoding RGB images and point clouds into a shared latent space with modality-aware fusion, VGDP constructs a holistic 3D visual representation of the environment. Dropout-guided fusion proves critical for improving both accuracy and robustness.
Experiments and Results
Insert Plug
VGDP
Diffusion Policy(RGB)
Diffusion Policy(RGBD)
DP3
Tablet in Different Position
Replaced with Airpods
Pour Cereal
Training: a consistent scene with fixed lighting and object
Transfer: various distributions of flowing rgb light strip and objects of different shapes and sizes
Pick Butter
VGDP
Pick Bottle
Generalize in training distribution
Transfer to extreme lighting shift
Simulation Tasks
We conduct our simulation benchmarks in RoboVerse, evaluating tasks from LIBERO, ManiSkill, and RLBench. All tasks are domain-randomized at three levels using a unified simulation codebase.Simulation Overview
BibTeX
@article@misc{tang2025visualgeometrydiffusionpolicyrobust,
title={Visual-Geometry Diffusion Policy: Robust Generalization via Complementarity-Aware Multimodal Fusion},
author={Yikai Tang and Haoran Geng and Sheng Zang and Pieter Abbeel and Jitendra Malik},
year={2025},
eprint={2511.22445},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.22445},
}