Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser Networks
By: Junhe Zhang , Wanli Ni , Pengwei Wang and more
Potential Business Impact:
Makes AI understand pictures and words faster.
Despite the promising paradigm enabled by integrating semantic communication (SemCom) with multimodal large models (MLMs) for transmitting and utilizing multimodal data, efficiently fusing and exploiting cross-modal information still remain challenging. Moreover, widely adopted transformer-based architectures inevitably produce excessively long token embeddings for transmission, which result in higher bandwidth consumption, increased power usage, and greater latency, rendering them impractical in resource-constrained networks. In this letter, we propose a task-oriented multimodal token transmission scheme for efficient multimodal information fusion and utilization. To improve inter-modal consistency and task-relevant token transmission, we design a two-stage training algotithm which involves cross-modal alignment followed by task-oriented fine-tuning. Meanwhile, token compression is performed using a sliding window pooling operation to conserve limited communication resources. To balance the trade-off between latency reduction and performance degradation caused by compression, we formulate a weighted-sum optimization problem over latency and inference performance. We jointly optimizes bandwidth, power allocation, and token length across users by using an alternating optimization method. Simulation results demonstrate that the proposed algorithm outperforms the baseline under different bandwidth and power budgets. Moreover, the two-stage training algorithm achieves higher accuracy across various signal-to-noise ratios than the method without cross-modal alignment.
Similar Papers
Multi-Modal Semantic Communication
Machine Learning (CS)
Lets computers understand pictures from your words.
Multi-Modal Multi-Task Semantic Communication: A Distributed Information Bottleneck Perspective
Information Theory
Sends messages with less data, more meaning.
Semantic Communication for Cooperative Multi-Tasking over Rate-Limited Wireless Channels with Implicit Optimal Prior
Signal Processing
Lets phones send messages using less data.