Images that Sound: Composing Images and Sounds on a Single Canvas

📅 2024-05-20
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses zero-shot cross-modal joint generation—synthesizing unified spectrograms that are visually natural as images and acoustically natural as audio. Methodologically, it is the first to collaboratively steer pretrained text-to-image and text-to-spectrogram diffusion models within a shared latent space, leveraging latent-space alignment, parallel reverse denoising, and multimodal prompt guidance—without fine-tuning or paired data. The core contribution is a novel zero-shot joint denoising mechanism that enforces semantic consistency across vision and audio modalities, enabling “audible visuals” (i.e., spectrograms interpretable both as images and as sounds). Quantitative evaluation demonstrates significant improvements over state-of-the-art baselines: +12.3% CLAP similarity, −28.6% FID, and 78.4% win rate in human preference studies—validating the method’s effectiveness and superiority in high-fidelity, cross-modally coherent synthesis.

Technology Category

Application Category

📝 Abstract
Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these visual spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt. Please see our project page for video results: https://ificl.github.io/images-that-sound/
Problem

Research questions and friction points this paper is trying to address.

Synthesize spectrograms resembling natural images
Generate spectrograms sounding like natural audio
Align spectrograms with desired audio and image prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained diffusion models
Operates in shared latent space
Denoises with audio and image models
🔎 Similar Papers
No similar papers found.