From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
By: Nicolas Schuler , Lea Dewald , Nick Baldig and more
Potential Business Impact:
Lets robots understand what they see and do.
Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
Similar Papers
Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots
Robotics
Robots see and think without internet.
Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments
CV and Pattern Recognition
Lets computers understand new pictures without training.
Evaluation of Vision-LLMs in Surveillance Video
CV and Pattern Recognition
Helps computers spot unusual things in videos.