Sanket Thakur

PhD candidate, Computer Vision. Gamer.

3D vs 2D CNN

A Close Look at Spatiotemporal Convolutions for Action recognition

26 June 2021

Authors: Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri
URL: https://openaccess.thecvf.com/content_cvpr_2018/papers/Tran_A_Closer_Look_CVPR_2018_paper.pdf
Comments: 2020 Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR)
Categories: action-recognition

What

Comparing different Spatiotemporal CNN model for action recognition task and proposes a new spatiotemporal CNN block : R(2+1)D CNN

Why

2D CNN applied to individual frames in a video gives appreciable results even after discarding temporal features.

How

The authors proposes multiple 3D CNN (reversed, mixed) and CNN with similar dimensions and validate with different clip size for accuracy performance comparison.

They also argue that since 3D CNN store the temporal information for the clip length, it tends to perform better than 2D CNN .

It is also proved that since R(2+1)D has more non-linearity (due to additional ReLU with 2D and 1d convolutional block), it has the best performance compared to other CNN models.

TL;DR Action recognition with a 34- layer R(2+1)D net

Notes

Training on longer clips yields better results on clip-level models, as the filter learn long-term temporal dimensions,


Comments

    No comments found for this article.

    Join the discussion for this article on this ticket. Comments appear on this page instantly.