Score: 2

Explore How to Inject Beneficial Noise in MLLMs

Published: November 17, 2025 | arXiv ID: 2511.12917v1

By: Ruishu Zhu , Sida Huang , Ziheng Jiao and more

Potential Business Impact:

Makes AI better at understanding pictures and words together.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1\sim2\%$ additional parameters. The relevant code is uploaded in the supplementary.

FANoise: Singular Value-Adaptive Noise Modulation for Robust Multimodal Representation Learning

Machine Learning (CS)

Makes AI understand pictures and words better.

26 Nov 2025 3

89%

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

CV and Pattern Recognition

Teaches AI to trust the right information.

28 Nov 2025 1

89%

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Machine Learning (CS)

Helps computers understand pictures and words together.

1 Aug 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

11 pages

Explore How to Inject Beneficial Noise in MLLMs

Makes AI better at understanding pictures and words together.

Technical Abstract

FANoise: Singular Value-Adaptive Noise Modulation for Robust Multimodal Representation Learning

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model