VGDP
Visual-Geometry Diffusion Policy

1 University of California, Berkeley
2 Nanyang Technological University

*Indicates Equal Contribution, Indicates Equal Advising

Abstract

Imitation learning has emerged as a crucial approach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods often struggle to generalize under spatial and visual randomizations, instead tending to overfit. To address this challenge, we propose Visual- Geometry Diffusion Policy (VGDP), a multimodal imitation learning framework that fuses RGB images and point clouds through a cross-attention mechanism, fully leveraging 3D representations while incorporating the long-overlooked benefits of 2D visual features. Across a benchmark of 18 simulated tasks and 3 real-world tasks using 7 different observation encoders, VGDP prevails other policies with an average performance improvement of 39.1%. More importantly, VGDP demonstrates strong robustness under visual and spatial perturbations, surpassing baselines with an average improvement of 41.5% in different visual conditions and 15.2% in different spatial settings.

Key Attributes

Effective

VGDP is evaluated across simulation and real-world domains against 6 baselines. Simulation experiments cover LIBERO, ManiSkill, and RLBench, each with three domain randomization levels, while real-world tests span three diverse manipulation tasks. We are excited to observe that VGDP consistently outperforms all baselines by a significant margin in both simulation and real-world settings, with an average improvement of 39.1%.

Effectiveness Visualization

Robust

We evaluate VGDP under diverse visual and spatial perturbations. Across shifts in position, lighting, texture, and camera viewpoint, VGDP maintains a consistently high and stable success rate, highlighting its strong robustness and generalization ability.

Robust Visualization

Transferable

VGDP demonstrates strong 0-shot transferability across diverse domains and tasks even trained within a narrow distribution. It transfers to unseen lighting, positions and objects.

Transferable Visualization

Methodology

By jointly encoding RGB images and point clouds into a shared latent space with modality-aware fusion, VGDP constructs a holistic 3D visual representation of the environment. Dropout-guided fusion proves critical for improving both accuracy and robustness.

Methodology Visualization

Experiments and Results

Insert Plug

In the insert plug task, the robot should insert a plug into a socket. The position of the plug is randomized evenly in 100 positions in a 20cm*20cm square.

VGDP

Diffusion Policy(RGB)

Diffusion Policy(RGBD)

DP3

During evaluation, we also test the policy with the tablet placed in different positions and replaced with a pair of Airpods.

Tablet in Different Position

Replaced with Airpods

Pour Cereal

In the pour cereal task, the robot should pour cereal from a bowl into a plate. We train the policy in a fixed environment, while testing its strengths in both fitting into the distribution and 0-shot transferring to unseen object and lighting conditions.

Training: a consistent scene with fixed lighting and object

Transfer: various distributions of flowing rgb light strip and objects of different shapes and sizes

Pick Butter

THe Pick Butter task imitates the same task in Libero-objects, where we aim to pick the butter from a cluttered table and place it into the basket.

VGDP

Aside of the findings in VGDP's effectiveness, we also discovered the surprisingly aligning results with the same task in simulation. We find that strongly domain-randomized simulation benchmarks can well reflect the performance in real-world settings.
Simulation vs Real-world Comparison
Performance comparison between simulation and real-world settings
Demo Rollout in Simulation
Mimic Task in Real-world

Pick Bottle

The Pick Bottle task evaluates spatial generalization in a precision-critical setting. The robot must reach the bottle cap with sub-centimeter accuracy (≤1 cm) in order to successfully grasp it. During training, the bottle is placed at 30 distinct locations, and performance is assessed across all 143 grid positions in the workspace.

Generalize in training distribution

Transfer to extreme lighting shift

Simulation Tasks

We conduct our simulation benchmarks in RoboVerse, evaluating tasks from LIBERO, ManiSkill, and RLBench. All tasks are domain-randomized at three levels using a unified simulation codebase.

Simulation Overview

BibTeX

@article@misc{tang2025visualgeometrydiffusionpolicyrobust,
        title={Visual-Geometry Diffusion Policy: Robust Generalization via Complementarity-Aware Multimodal Fusion}, 
        author={Yikai Tang and Haoran Geng and Sheng Zang and Pieter Abbeel and Jitendra Malik},
        year={2025},
        eprint={2511.22445},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2511.22445}, 
  }