AstraNav-Memory: Contexts Compression for Long Memory
By: Botao Ren , Junjun Hu , Xinda Xue and more
Lifelong embodied navigation requires agents to accumulate, retain, and exploit spatial-semantic experience across tasks, enabling efficient exploration in novel environments and rapid goal reaching in familiar ones. While object-centric memory is interpretable, it depends on detection and reconstruction pipelines that limit robustness and scalability. We propose an image-centric memory framework that achieves long-term implicit memory via an efficient visual context compression module end-to-end coupled with a Qwen2.5-VL-based navigation policy. Built atop a ViT backbone with frozen DINOv3 features and lightweight PixelUnshuffle+Conv blocks, our visual tokenizer supports configurable compression rates; for example, under a representative 16$\times$ compression setting, each image is encoded with about 30 tokens, expanding the effective context capacity from tens to hundreds of images. Experimental results on GOAT-Bench and HM3D-OVON show that our method achieves state-of-the-art navigation performance, improving exploration in unfamiliar environments and shortening paths in familiar ones. Ablation studies further reveal that moderate compression provides the best balance between efficiency and accuracy. These findings position compressed image-centric memory as a practical and scalable interface for lifelong embodied agents, enabling them to reason over long visual histories and navigate with human-like efficiency.
Similar Papers
Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation
Robotics
Helps robots see and act faster.
Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs
Artificial Intelligence
Helps robots understand places better to find their way.
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
CV and Pattern Recognition
Makes computers understand long texts better.