Multi Agents Semantic Emotion Aligned Music to Image Generation with Music Derived Captions
By: Junchang Shi, Gang Li
When people listen to music, they often experience rich visual imagery. We aim to externalize this inner imagery by generating images conditioned on music. We propose MESA MIG, a multi agent semantic and emotion aligned framework that first produces structured music captions and then refines them with cooperating agents specializing in scene, motion, style, color, and composition. In parallel, a Valence Arousal regression head predicts continuous affective states from music, while a CLIP based visual VA head estimates emotions from images. These components jointly enforce semantic and emotional alignment between music and synthesized images. Experiments on curated music image pairs show that MESA MIG outperforms caption only and single agent baselines in aesthetic quality, semantic consistency, and VA alignment, and achieves competitive emotion regression performance compared with state of the art music and image emotion models.
Similar Papers
Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
Sound
Creates music from pictures and words.
MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions
Sound
Matches pictures and music to feelings.
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
Sound
Turns pictures into music with explanations.