Score: 2

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

Published: August 19, 2025 | arXiv ID: 2508.13446v1

By: Catherine Glossop , William Chen , Arjun Bhorkar and more

BigTech Affiliations: University of California, Berkeley

Potential Business Impact:

Teaches robots to follow tricky instructions better.

Generalist robots should be able to understand and follow user instructions, but current vision-language-action (VLA) models struggle with following fine-grained commands despite providing a powerful architecture for mapping open-vocabulary natural language instructions to robot actions. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision language models to create counterfactual labels. Our method improves the language-following capabilities of VLAs by increasing the diversity and granularity of language grounding for robot datasets by generating counterfactual language and actions. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting visual language navigation experiments in 3 different indoor and outdoor environments. Our experiments demonstrate that counterfactual relabeling, without any additional data collection, significantly improves instruction-following in VLA policies, making them competitive with state-of-the-art methods and increasing success rate by 27% on navigation tasks.

Do What? Teaching Vision-Language-Action Models to Reject the Impossible

Artificial Intelligence

Robots understand when you're wrong and ask why.

22 Aug 2025 1

90%

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

Robotics

Robots understand and do tasks better.

23 Jul 2025 0

90%

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Robotics

Robots learn to do more tasks with better instructions.

11 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

12 pages

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

Teaches robots to follow tricky instructions better.

Technical Abstract

Do What? Teaching Vision-Language-Action Models to Reject the Impossible

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models