VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System
By: Gwangyeon Ahn, Jiwan Seo, Joonhyuk Kang
Potential Business Impact:
Sends pictures and words together, saving space.
We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.
Similar Papers
Exploring Textual Semantics Diversity for Image Transmission in Semantic Communication Systems using Visual Language Model
CV and Pattern Recognition
Sends pictures better by describing them with words.
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
CV and Pattern Recognition
Lets computers understand many satellite picture jobs.
STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models
CV and Pattern Recognition
Helps self-driving cars understand traffic better.