Score: 1

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Published: June 19, 2025 | arXiv ID: 2506.15953v1

By: Liang Heng , Haoran Geng , Kaifeng Zhang and more

BigTech Affiliations: University of California, Berkeley

Potential Business Impact:

Robots learn to grab and move things precisely.

Business Areas:

Autonomous Vehicles Transportation

Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.

Vi-TacMan: Articulated Object Manipulation via Vision and Touch

Robotics

Robots use eyes and touch to grab anything.

7 Oct 2025 0

91%

ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface

Robotics

Teaches robots to grab things by feeling them.

8 Apr 2025 0

89%

ViTaSCOPE: Visuo-tactile Implicit Representation for In-hand Pose and Extrinsic Contact Estimation

Robotics

Helps robots feel and see objects to grab them.

13 Jun 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

21 pages

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Robots learn to grab and move things precisely.

Technical Abstract

Vi-TacMan: Articulated Object Manipulation via Vision and Touch

ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface

ViTaSCOPE: Visuo-tactile Implicit Representation for In-hand Pose and Extrinsic Contact Estimation