13 Jun 2019 | Christopher Choy, JunYoung Gwak, Silvio Savarese
This paper introduces 4D spatio-temporal convolutional neural networks (CNNs) for processing 3D videos, which are sequences of depth images or LIDAR scans. The authors propose a generalized sparse convolution that can handle high-dimensional data and implement it using an open-source auto-differentiation library for sparse tensors. They create 4D spatio-temporal CNNs and validate them on various 3D semantic segmentation benchmarks and datasets. To address the challenges in high-dimensional spaces, they introduce a hybrid kernel and a trilateral-stationary conditional random field (CRF) to enforce spatio-temporal consistency. Experimental results show that their 4D CNNs outperform 2D or 2D-3D hybrid methods and are more robust to noise, faster, and more efficient than 3D CNNs. The paper also includes a detailed analysis of the proposed methods and their performance on different datasets.This paper introduces 4D spatio-temporal convolutional neural networks (CNNs) for processing 3D videos, which are sequences of depth images or LIDAR scans. The authors propose a generalized sparse convolution that can handle high-dimensional data and implement it using an open-source auto-differentiation library for sparse tensors. They create 4D spatio-temporal CNNs and validate them on various 3D semantic segmentation benchmarks and datasets. To address the challenges in high-dimensional spaces, they introduce a hybrid kernel and a trilateral-stationary conditional random field (CRF) to enforce spatio-temporal consistency. Experimental results show that their 4D CNNs outperform 2D or 2D-3D hybrid methods and are more robust to noise, faster, and more efficient than 3D CNNs. The paper also includes a detailed analysis of the proposed methods and their performance on different datasets.