Score: 0

Modality Inflation: Energy Characterization and Optimization Opportunities for MLLM Inference

Published: December 27, 2025 | arXiv ID: 2512.22695v1

By: Mona Moghadampanah , Adib Rezaei Shahmirzadi , Farhana Amin and more

Potential Business Impact:

Makes AI models use less power for images.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Multimodal large language models (MLLMs) are built on text-only LLMs by incorporating additional modalities, enabling multimodal understanding and a broader range of applications. However, these additions introduce a previously unexplored energy trade-off across modalities that remains poorly understood, as most prior work focuses on text-only models. In this paper, we examine modality inflation, a key source of inefficiency in which multimodal inputs increase inference workloads through extra encoding stages and expanded token sequences. We provide the first detailed, stage-level analysis of energy consumption in MLLM inference by breaking the pipeline into vision encoding, prefill, and decoding stages. Using four representative MLLMs evaluated on NVIDIA A100 GPU, we quantify the additional energy required for multimodal inference compared to text-only baselines, observing overheads ranging from 17% to 94% across models for identical inputs. Our results show that energy bottlenecks differ widely across model architectures, stemming either from compute-heavy vision encoders or from the downstream impact of large visual token sequences during prefill. By examining GPU power traces, we further uncover substantial GPU underutilization during multimodal execution and show that input complexity leads to markedly different energy scaling behaviors across models. Finally, we demonstrate that stage-wise dynamic voltage and frequency scaling (DVFS) is an effective optimization, allowing energy savings with only modest performance impact. Together, these findings offer practical insights and concrete guidance for designing more energy-efficient multimodal LLM serving systems.

Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings

Machine Learning (CS)

Makes AI use less power without losing smarts.

14 Jan 2025 1

91%

Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing

Distributed, Parallel, and Cluster Computing

Makes AI understand pictures and videos faster.

19 Dec 2025 1

90%

Energy Considerations of Large Language Model Inference and Efficiency Optimizations

Computation and Language

Cuts AI's energy use by 73%.

24 Apr 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

8 pages

Modality Inflation: Energy Characterization and Optimization Opportunities for MLLM Inference

Makes AI models use less power for images.

Technical Abstract

Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings

Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing

Energy Considerations of Large Language Model Inference and Efficiency Optimizations