Sanket Thakur

PhD candidate, Computer Vision. Gamer.

Music Separation

Music Gesture for Visual Sound Separation

20 August 2021

Authors: Chuang Gan et al.
Comments: 2020 Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR)
Categories: sound-separation


Visual Sound separation for different music instruments.


Existing methods rely on appearance (also motion (optical flow)) based methods but do explore the relationship between audio and body keypoints.


Compute keypoints for three video music datasets [ Music-21, URMP, AtinPiano ]. Main motivation is to associate body dynamics with audio signals for sound keypoints. Extracts global context features from video frames and merge with the outputs of context-aware GCNN on human keypoints. Each input to G-CNN is 2D keypoint coordinate. For audio, mixture spectograms are fed to an encoder-decoder network, where are the (visual + GCNN) are fused with encoder features with a self-guided attention mechanism. The U-net is responsible to generate binary spectogram mask for individual audio which is compared with a ground truth mask for dominant component in input mixed audio.

visual features -> (2048, D_a x 1) D_a - dimensions
graph features -> (N x T x D_p)

v = visual network fusion -> N x T x (D_a + D_p) ~ V x D_v

a = audio features -> F x D_s

## Self-Guided Attention :

h_t =, a)), v) + a

## Fused Features :

z = MLP(h_t) + h_t

The MLP is simply a two fully connected layer. The fused features are then fed to upsample layers of decoder part of the U-net. To create a binary mask, sigmoid operation is performed on the feature map.

The network is trained by minimising the per-pixel sigmoid cross-entropy loss.

TL;DR Reconstruct waveforms from spectograms.


Keypoint based representation only can perform better than RGB + Keypoints.


    No comments found for this article.

    Join the discussion for this article on this ticket. Comments appear on this page instantly.