Score: 1

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Published: June 13, 2025 | arXiv ID: 2506.12198v1

By: Sibo Dong , Ismail Shaheen , Maggie Shen and more

Potential Business Impact:

Makes stories with pictures that make sense.

Business Areas:

Visual Search Internet Services

Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.

VisTA: Vision-Text Alignment Model with Contrastive Learning using Multimodal Data for Evidence-Driven, Reliable, and Explainable Alzheimer's Disease Diagnosis

CV and Pattern Recognition

Helps doctors find Alzheimer's with pictures.

3 Feb 2025 2

88%

VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

Computation and Language

Computers write stories from pictures.

27 Apr 2025 0

88%

Vistoria: A Multimodal System to Support Fictional Story Writing through Instrumental Text-Image Co-Editing

Human-Computer Interaction

Helps writers create stories with pictures and words.

17 Sep 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com github.com github.com github.com

Page Count

15 pages

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Makes stories with pictures that make sense.

Technical Abstract

VisTA: Vision-Text Alignment Model with Contrastive Learning using Multimodal Data for Evidence-Driven, Reliable, and Explainable Alzheimer's Disease Diagnosis

VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

Vistoria: A Multimodal System to Support Fictional Story Writing through Instrumental Text-Image Co-Editing