Score: 0

Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal

Published: December 14, 2025 | arXiv ID: 2512.12875v1

By: Weihan Xu , Kan Jen Cheng , Koichi Saito and more

Potential Business Impact:

Edits sound and video together, perfectly matched.

Business Areas:

Motion Capture Media and Entertainment, Video

Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.

Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

Multimedia

Makes videos and sounds match perfectly.

8 Dec 2025 1

89%

Object-AVEdit: An Object-level Audio-Visual Editing Model

Multimedia

Changes sounds and pictures of objects in videos.

27 Sep 2025 2

89%

AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control

Multimedia

Changes video sounds using pictures and words.

26 Nov 2025 3

View PDF Login to Bookmark

Page Count

18 pages

Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal

Edits sound and video together, perfectly matched.

Technical Abstract

Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

Object-AVEdit: An Object-level Audio-Visual Editing Model

AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control