Audio-Visual Inpainting: Reconstructing Missing Visual Information with Sound

Sanket Thakur

Pietro Morerio

Alessio Del Bue

Vittorio Murino

[Paper]

[GitHub] (Soon)

Fig 1. Imagine a scene where originally there was a plane flying in the sky and later the plane is removed. Probably, when we will inpaint it, it will be completed with a visually consistent texture, but we will miss the plane and the original semantic content in the scene. Clearly, without any additional information, the plane cannot be re- covered. Here, for the first time, we tackle a multi-modal instance of the classical inpainting problem: given a missing image region and an audio sample, the inpainted region should comply with the semantic content of the sound as well as with the surrounding visual appearance.

Abstract

We tackle audio-visual inpainting, the problem of completing an image in such a way to be consistent with the sound associated to the scene. To this end, we propose a multimodal, audio-visual inpaint- ing method (AVIN), and show how to leverage sound to reconstruct semantically consistent images. AVIN is a 2-stage algorithm, which first learns the scene semantics and reconstructs low resolution im- ages based on a conditional probability distribution of pixels in the space conditioned to audio, and then refines such result with a GAN- based network to increase the resolution of the reconstructed image. We show that AVIN is able to recover the original content, especially in the hard cases where the missing area heavily degrades the scene semantics: it can perform cross-modal generation whenever no vi- sual context is observed at all, reconstructing visual data from sound only.

Talk

Code

Fig 2. Our AVIN model works in two stages. Stage I : AudioPixelCNN takes a low resolution masked image as an input. The image is then conditioned upon the corresponding audio sample for the subject that is to be filled with in the missing region. This outputs a low-resolution representation of complete image filling the missing region with context information. This outputs is stochastic since the pixels are sampled from a probability distribution to output a wide range of images. Stage II: In this stage, a ADSR-GAN model is used to refine the low-resolution image from stage I. The GAN takes the output from previous stage, high resolution masked image and also the corresponding audio sample to generate a high-resolution representation of the final image. The image is then fed to a discriminator to follow a min-max learning to better generate high-resolution images while keeping the visual semantics of the missing regions preserved.

[Slides] [Poster]

Results

1 / 3

Cross-Model Generation on SubURMP

2 / 3

Cross-Model Generation on AudioSet

3 / 3

Cross-Model Generation on VGGSound

❮ ❯

Paper and Supplementary Material

V. Sanguineti, S. Thakur, P. Morerio, A. Del Bue, V. Murino,
Audio-Visual Inpainting: Reconstructing Missing Visual Information with Sound
In International Conference on Acoustics, Speech, and Signal Processing, 2023
(hosted on IEEE)

[Bibtex]

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.