Score: 0

FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

Published: March 25, 2025 | arXiv ID: 2503.19839v2

By: Jun Zhou , Jiahao Li , Zunnan Xu and more

Potential Business Impact:

Edits pictures precisely from your words.

Business Areas:

Photo Editing Content and Publishing, Media and Entertainment

Currently, instruction-based image editing methods have made significant progress by leveraging the powerful cross-modal understanding capabilities of vision language models (VLMs). However, they still face challenges in three key areas: 1) complex scenarios; 2) semantic consistency; and 3) fine-grained editing. To address these issues, we propose FireEdit, an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM. FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process. Specifically, we enhance the fine-grained visual perception capabilities of the VLM by introducing additional region tokens. Relying solely on the output of the LLM to guide the diffusion model may lead to suboptimal editing results. Therefore, we propose a Time-Aware Target Injection module and a Hybrid Visual Cross Attention module. The former dynamically adjusts the guidance strength at various denoising stages by integrating timestep embeddings with the text embeddings. The latter enhances visual details for image editing, thereby preserving semantic consistency between the edited result and the source image. By combining the VLM enhanced with fine-grained region tokens and the time-dependent diffusion model, FireEdit demonstrates significant advantages in comprehending editing instructions and maintaining high semantic consistency. Extensive experiments indicate that our approach surpasses the state-of-the-art instruction-based image editing methods. Our project is available at https://zjgans.github.io/fireedit.github.io.

SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding

CV and Pattern Recognition

Edits pictures perfectly with just words.

17 Apr 2025 2

89%

An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

CV and Pattern Recognition

Lets you change pictures by talking to them.

24 Aug 2025 1

89%

Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

CV and Pattern Recognition

Teaches computers to see tiny differences in pictures.

8 Jun 2025 3

View PDF Login to Bookmark

Page Count

11 pages

FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

Edits pictures precisely from your words.

Technical Abstract

SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding

An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning