Score: 1

CATCH: A Modular Cross-domain Adaptive Template with Hook

Published: October 30, 2025 | arXiv ID: 2510.26582v1

By: Xinjin Li , Yulie Lu , Jinghan Cao and more

Potential Business Impact:

Makes AI understand different kinds of pictures better.

Business Areas:

Image Recognition Data and Analytics, Software

Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.

Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

CV and Pattern Recognition

Makes computers judge video quality better, faster.

8 Aug 2025 0

87%

CATCH: A Controllable Theme Detection Framework with Contextualized Clustering and Hierarchical Generation

Computation and Language

Helps chatbots understand what you're talking about.

25 Dec 2025 1

86%

Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

CV and Pattern Recognition

Teaches computers to learn from few pictures.

4 Sep 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇨🇳 United States, China

Page Count

18 pages

CATCH: A Modular Cross-domain Adaptive Template with Hook

Makes AI understand different kinds of pictures better.

Technical Abstract

Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

CATCH: A Controllable Theme Detection Framework with Contextualized Clustering and Hierarchical Generation

Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model