Score: 2

RAPTOR: Real-Time High-Resolution UAV Video Prediction with Efficient Video Attention

Published: December 25, 2025 | arXiv ID: 2512.21710v1

By: Zhan Chen , Zile Guo , Enze Zhu and more

Potential Business Impact:

Drones see future to fly safer.

Business Areas:

Image Recognition Data and Analytics, Software

Video prediction is plagued by a fundamental trilemma: achieving high-resolution and perceptual quality typically comes at the cost of real-time speed, hindering its use in latency-critical applications. This challenge is most acute for autonomous UAVs in dense urban environments, where foreseeing events from high-resolution imagery is non-negotiable for safety. Existing methods, reliant on iterative generation (diffusion, autoregressive models) or quadratic-complexity attention, fail to meet these stringent demands on edge hardware. To break this long-standing trade-off, we introduce RAPTOR, a video prediction architecture that achieves real-time, high-resolution performance. RAPTOR's single-pass design avoids the error accumulation and latency of iterative approaches. Its core innovation is Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling. Instead of processing flattened spacetime tokens with $O((ST)^2)$ or $O(ST)$ complexity, EVA alternates operations along the spatial (S) and temporal (T) axes. This factorization reduces the time complexity to $O(S + T)$ and memory complexity to $O(max(S, T))$, enabling global context modeling at $512^2$ resolution and beyond, operating directly on dense feature maps with a patch-free design. Complementing this architecture is a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp, temporally coherent details. Experiments show RAPTOR is the first predictor to exceed 30 FPS on a Jetson AGX Orin for $512^2$ video, setting a new state-of-the-art on UAVid, KTH, and a custom high-resolution dataset in PSNR, SSIM, and LPIPS. Critically, RAPTOR boosts the mission success rate in a real-world UAV navigation task by 18/%, paving the way for safer and more anticipatory embodied agents.

A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data

CV and Pattern Recognition

Spots drones using many senses at once.

19 Nov 2025 0

87%

Predictive Uncertainty for Runtime Assurance of a Real-Time Computer Vision-Based Landing System

CV and Pattern Recognition

Helps planes land safely using cameras.

13 Aug 2025 1

87%

MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention

CV and Pattern Recognition

Helps drones understand each other's movements.

17 Oct 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

8 pages

RAPTOR: Real-Time High-Resolution UAV Video Prediction with Efficient Video Attention

Drones see future to fly safer.

Technical Abstract

A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data

Predictive Uncertainty for Runtime Assurance of a Real-Time Computer Vision-Based Landing System

MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention