ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning
By: Wendi Chen , Han Xue , Yi Wang and more
Potential Business Impact:
Robots learn to touch and move objects precisely.
Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.
Similar Papers
Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation
Robotics
Robots learn to touch and react like humans.
3D Flow Diffusion Policy: Visuomotor Policy Learning via Generating Flow in 3D Space
Robotics
Robots learn to grab and move things better.
Unified Multimodal Diffusion Forcing for Forceful Manipulation
Robotics
Teaches robots to learn from seeing, doing, and feeling.