Spatio-Temporal LLM: Reasoning about Environments and Actions
By: Haozhen Zheng , Beitong Tian , Mingyuan Wu and more
Potential Business Impact:
Helps robots understand places and recent events.
Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to address prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this issue, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a "spatio-temporal LLM" (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work. Code and data are available at https://zoezheng126.github.io/STLLM-website/.
Similar Papers
M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
Artificial Intelligence
Helps computers understand and move in the real world.
Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges
Artificial Intelligence
Helps computers understand movement and location better.
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos
CV and Pattern Recognition
Teaches computers to understand videos like people.