Harnessing Input-Adaptive Inference for Efficient VLN
By: Dongwoo Kang , Akhil Perincherry , Zachary Coalson and more
Potential Business Impact:
Helps robots navigate using less computer power.
An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate action for an agent. While they have significantly improved performance, the scale of these models can be a bottleneck in practical settings with limited computational resources. In this work, we propose a novel input-adaptive navigation method to enhance VLN model efficiency. We first show that existing input-adaptive mechanisms fail to reduce computations without substantial performance degradation. To address this, we introduce three adaptive algorithms, each deployed at a different level: (1) To improve spatial efficiency, we selectively process panoramic views at each observation of an agent. (2) To improve intra-model efficiency, we propose importance-based adaptive thresholding for the early-exit methods. (3) To improve temporal efficiency, we implement a caching mechanism that prevents reprocessing of views previously seen by the agent. In evaluations on seven VLN benchmarks, we demonstrate over a 2$\times$ reduction in computation across three off-the-shelf agents in both standard and continuous environments. Our code is publicly available at https://github.com/secure-ai-systems-group/adaptive-vision-and-language-navigation.
Similar Papers
Efficient-VLN: A Training-Efficient Vision-Language Navigation Model
CV and Pattern Recognition
Teaches robots to navigate using less training.
FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks
CV and Pattern Recognition
Helps robots learn new places without retraining.
User-Feedback-Driven Continual Adaptation for Vision-and-Language Navigation
Artificial Intelligence
Teaches robots to learn from user corrections.