Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025
By: Qiaohui Chu , Haoyu Zhang , Yisen Feng and more
Potential Business Impact:
Predicts future actions by watching and understanding videos.
In this report, we present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder. The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction. Our code will be released at https://github.com/CorrineQiu/Ego4D-LTA-Challenge-2025.
Similar Papers
Vision and Intention Boost Large Language Model in Long-Term Action Anticipation
CV and Pattern Recognition
Predicts future actions by watching and understanding.
Bidirectional Action Sequence Learning for Long-term Action Anticipation with Large Language Models
CV and Pattern Recognition
Predicts future actions by looking forward and backward.
Ego-centric Predictive Model Conditioned on Hand Trajectories
CV and Pattern Recognition
Predicts what you'll do and what happens next.