A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning
By: Swadhin Das , Divyansh Mundra , Priyanshu Dayal and more
Potential Business Impact:
Makes satellite pictures tell better stories.
Transformer-based models have achieved strong performance in remote sensing image captioning by capturing long-range dependencies and contextual information. However, their practical deployment is hindered by high computational costs, especially in multi-modal frameworks that employ separate transformer-based encoders and decoders. In addition, existing remote sensing image captioning models primarily focus on high-level semantic extraction while often overlooking fine-grained structural features such as edges, contours, and object boundaries. To address these challenges, a lightweight transformer architecture is proposed by reducing the dimensionality of the encoder layers and employing a distilled version of GPT-2 as the decoder. A knowledge distillation strategy is used to transfer knowledge from a more complex teacher model to improve the performance of the lightweight network. Furthermore, an edge-aware enhancement strategy is incorporated to enhance image representation and object boundary understanding, enabling the model to capture fine-grained spatial details in remote sensing images. Experimental results demonstrate that the proposed approach significantly improves caption quality compared to state-of-the-art methods.
Similar Papers
Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI
CV and Pattern Recognition
Makes robots understand pictures faster on small devices.
Image Recognition with Online Lightweight Vision Transformer: A Survey
CV and Pattern Recognition
Makes computer vision faster and use less power.
Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation
CV and Pattern Recognition
Makes computers describe pictures in many languages.