13 Jun 2019 | Christopher Choy, JunYoung Gwak, Silvio Savarese
This paper proposes 4D spatio-temporal convolutional neural networks (Minkowski ConvNets) for processing 3D videos. The authors introduce a generalized sparse convolution that can handle high-dimensional data efficiently by using sparse tensors. They develop an open-source library for sparse tensor operations and demonstrate that their 4D networks outperform 2D and 3D methods in 3D semantic segmentation tasks. The networks use hybrid kernels and a 7D conditional random field to enforce spatio-temporal consistency. Experiments show that the 4D networks are more robust to noise and can be faster than 3D counterparts in some cases. The paper also presents various 3D and 4D datasets for evaluation and compares the performance of different network architectures. The authors conclude that their 4D ConvNets provide a more effective solution for spatio-temporal perception tasks.This paper proposes 4D spatio-temporal convolutional neural networks (Minkowski ConvNets) for processing 3D videos. The authors introduce a generalized sparse convolution that can handle high-dimensional data efficiently by using sparse tensors. They develop an open-source library for sparse tensor operations and demonstrate that their 4D networks outperform 2D and 3D methods in 3D semantic segmentation tasks. The networks use hybrid kernels and a 7D conditional random field to enforce spatio-temporal consistency. Experiments show that the 4D networks are more robust to noise and can be faster than 3D counterparts in some cases. The paper also presents various 3D and 4D datasets for evaluation and compares the performance of different network architectures. The authors conclude that their 4D ConvNets provide a more effective solution for spatio-temporal perception tasks.