Score: 1

DIFF-MF: A Difference-Driven Channel-Spatial State Space Model for Multi-Modal Image Fusion

Published: January 9, 2026 | arXiv ID: 2601.05538v1

By: Yiming Sun , Zifan Ye , Qinghua Hu and more

Potential Business Impact:

Combines night vision and regular camera images.

Business Areas:

Image Recognition Data and Analytics, Software

Multi-modal image fusion aims to integrate complementary information from multiple source images to produce high-quality fused images with enriched content. Although existing approaches based on state space model have achieved satisfied performance with high computational efficiency, they tend to either over-prioritize infrared intensity at the cost of visible details, or conversely, preserve visible structure while diminishing thermal target salience. To overcome these challenges, we propose DIFF-MF, a novel difference-driven channel-spatial state space model for multi-modal image fusion. Our approach leverages feature discrepancy maps between modalities to guide feature extraction, followed by a fusion process across both channel and spatial dimensions. In the channel dimension, a channel-exchange module enhances channel-wise interaction through cross-attention dual state space modeling, enabling adaptive feature reweighting. In the spatial dimension, a spatial-exchange module employs cross-modal state space scanning to achieve comprehensive spatial fusion. By efficiently capturing global dependencies while maintaining linear computational complexity, DIFF-MF effectively integrates complementary multi-modal features. Experimental results on the driving scenarios and low-altitude UAV datasets demonstrate that our method outperforms existing approaches in both visual quality and quantitative evaluation.

IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object Detection

CV and Pattern Recognition

Helps cameras see better in fog and darkness.

11 Sep 2025 2

89%

Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

CV and Pattern Recognition

Combines pictures using words to make better images.

8 Dec 2025 1

88%

FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution

CV and Pattern Recognition

Makes blurry pictures sharper and clearer.

11 Sep 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

13 pages

DIFF-MF: A Difference-Driven Channel-Spatial State Space Model for Multi-Modal Image Fusion

Combines night vision and regular camera images.

Technical Abstract

IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object Detection

Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution