Score: 1

Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models

Published: September 7, 2025 | arXiv ID: 2509.05925v1

By: Ruiqi Shen , Haotian Wu , Wenjing Zhang and more

Potential Business Impact:

Makes pictures smaller, keeping their meaning.

Business Areas:

Image Recognition Data and Analytics, Software

Recent deep learning-based methods for lossy image compression achieve competitive rate-distortion performance through extensive end-to-end training and advanced architectures. However, emerging applications increasingly prioritize semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks. These challenges call for advanced semantic compression paradigms. Motivated by the zero-shot and representational capabilities of multimodal foundation models, we propose a novel semantic compression method based on the contrastive language-image pretraining (CLIP) model. Rather than compressing images for reconstruction, we propose compressing the CLIP feature embeddings into minimal bits while preserving semantic information across different tasks. Experiments show that our method maintains semantic integrity across benchmark datasets, achieving an average bit rate of approximately 2-3* 10(-3) bits per pixel. This is less than 5% of the bitrate required by mainstream image compression approaches for comparable performance. Remarkably, even under extreme compression, the proposed approach exhibits zero-shot robustness across diverse data distributions and downstream tasks.

UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection

CV and Pattern Recognition

Finds fake videos even when they are squeezed.

24 Nov 2025 0

90%

Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

CV and Pattern Recognition

Makes computers understand pictures and words together better.

11 Nov 2025 0

89%

Knowledge-Base based Semantic Image Transmission Using CLIP

CV and Pattern Recognition

Sends pictures by describing their meaning, not pixels.

1 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇬🇧 United Kingdom

Repos / Data Links

github.com

Page Count

6 pages

Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models

Makes pictures smaller, keeping their meaning.

Technical Abstract

UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection

Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

Knowledge-Base based Semantic Image Transmission Using CLIP