Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond
By: Jessie Richter-Powell, Antonio Torralba, Jonathan Lorraine
Potential Business Impact:
Makes computers create sounds from your words.
We introduce Audio-SDS, a generalization of Score Distillation Sampling (SDS) to text-conditioned audio diffusion models. While SDS was initially designed for text-to-3D generation using image diffusion, its core idea of distilling a powerful generative prior into a separate parametric representation extends to the audio domain. Leveraging a single pretrained model, Audio-SDS enables a broad range of tasks without requiring specialized datasets. In particular, we demonstrate how Audio-SDS can guide physically informed impact sound simulations, calibrate FM-synthesis parameters, and perform prompt-specified source separation. Our findings illustrate the versatility of distillation-based methods across modalities and establish a robust foundation for future work using generative priors in audio tasks.
Similar Papers
Rethinking Score Distilling Sampling for 3D Editing and Generation
CV and Pattern Recognition
Makes 3D models from text, and changes them.
RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling
CV and Pattern Recognition
Makes 3D pictures follow your exact ideas.
Text-to-3D Generation using Jensen-Shannon Score Distillation
CV and Pattern Recognition
Creates better 3D pictures from words.