Score: 0

DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

Published: September 19, 2025 | arXiv ID: 2509.16017v1

By: Meng Yang , Fan Fan , Zizhuo Li and more

Potential Business Impact:

Helps computers match pictures from different cameras.

Business Areas:

Image Recognition Data and Analytics, Software

Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform poorly and lack adaptability to diverse scenarios. Vision Foundation Model (VFM), trained on large-scale data, yields generalizable and robust feature representations adapted to data and tasks of various modalities, including multimodal matching. Thus, we propose DistillMatch, a multimodal image matching method using knowledge distillation from VFM. DistillMatch employs knowledge distillation to build a lightweight student model that extracts high-level semantic features from VFM (including DINOv2 and DINOv3) to assist matching across modalities. To retain modality-specific information, it extracts and injects modality category information into the other modality's features, which enhances the model's understanding of cross-modal correlations. Furthermore, we design V2I-GAN to boost the model's generalization by translating visible to pseudo-infrared images for data augmentation. Experiments show that DistillMatch outperforms existing algorithms on public datasets.

Task-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation

CV and Pattern Recognition

Teaches computers to see diseases in X-rays.

10 Mar 2025 0

89%

CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

CV and Pattern Recognition

Teaches self-driving cars to see better.

12 Mar 2025 1

88%

EmoVLM-KD: Fusing Distilled Expertise with Vision-Language Models for Visual Emotion Analysis

Multimedia

Helps computers understand emotions in pictures better.

12 May 2025 1

View PDF Login to Bookmark

Page Count

10 pages

DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

Helps computers match pictures from different cameras.

Technical Abstract

Task-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation

CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

EmoVLM-KD: Fusing Distilled Expertise with Vision-Language Models for Visual Emotion Analysis