EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
By: Bingxuan Li , Yiming Cui , Yicheng He and more
Potential Business Impact:
Makes videos tell stories with better sound.
Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.
Similar Papers
DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation
Sound
Makes videos talk with matching sounds.
FoleyBench: A Benchmark For Video-to-Audio Models
Sound
Makes videos create their own matching sounds.
FoleyBench: A Benchmark For Video-to-Audio Models
Sound
Makes videos create their own matching sounds.