Invariant Scattering Convolution Networks
Joan Bruna and Stéphane Mallat
CMAP, Ecole Polytechnique, Palaiseau, France
Abstract—A wavelet scattering network computes a translation invariant image representation, which is stable to deformations and preserves high frequency information for classification. It cascades wavelet transform convolutions with non-linear modulus and averaging operators. The first network layer outputs SIFT-type descriptors whereas the next layers provide complementary invariant information which improves classification. The mathematical analysis of wavelet scattering networks explains important properties of deep convolution networks for classification.
A scattering representation of stationary processes incorporates higher order moments and can thus discriminate textures having same Fourier power spectrum. State of the art classification results are obtained for handwritten digits and texture discrimination, with a Gaussian kernel SVM and a generative PCA classifier.
## 1 INTRODUCTION
A major difficulty of image classification comes from the considerable variability within image classes and the inability of Euclidean distances to measure image similarities. Part of this variability is due to rigid translations, rotations or scaling. This variability is often uninformative for classification and should thus be eliminated. In the framework of kernel classifiers [31], metrics are defined as a Euclidean distance applied on a representation $ \Phi(x) $ of signals x. The operator $ \Phi $ must therefore be invariant to these rigid transformations.
Non-rigid deformations also induce important variability within object classes $ [3] $ , $ [15] $ , $ [34] $ . For instance, in handwritten digit recognition, one must take into account digit deformations due to different writing styles. However, a full deformation invariance would reduce discrimination since a digit can be deformed into a different digit, for example a one into a seven. The representation must therefore not be deformation invariant but continuous to deformations, to handle small deformations with a kernel classifier. A small deformation of an image x into $ x' $ should correspond to a small Euclidean distance $ \|\Phi(x)-\Phi(x')\| $ in the representation space, as further explained in Section 2.
Translation invariant representations can be constructed with registration algorithms $ [32] $ or with the Fourier transform modulus. However, Section 2.1 explains why these invariants are not stable to deformations and hence not adapted to image classification. Trying to avoid Fourier transform instabilities suggests replacing sinusoidal waves by localized waveforms such as wavelets. However, wavelet transforms are not invariant to translations. Building invariant representations from wavelet coefficients requires introducing non-linear operators, which leads to a convolution network architecture.
Deep convolution networks have the ability to build large-scale invariants which are stable to deformations $ [18] $ . They have been applied to a wide range of image classification tasks. Despite the remarkable successes of this neural network architecture, the properties and optimal configurationsInvariant Scattering Convolution Networks
Joan Bruna and Stéphane Mallat
CMAP, Ecole Polytechnique, Palaiseau, France
Abstract—A wavelet scattering network computes a translation invariant image representation, which is stable to deformations and preserves high frequency information for classification. It cascades wavelet transform convolutions with non-linear modulus and averaging operators. The first network layer outputs SIFT-type descriptors whereas the next layers provide complementary invariant information which improves classification. The mathematical analysis of wavelet scattering networks explains important properties of deep convolution networks for classification.
A scattering representation of stationary processes incorporates higher order moments and can thus discriminate textures having same Fourier power spectrum. State of the art classification results are obtained for handwritten digits and texture discrimination, with a Gaussian kernel SVM and a generative PCA classifier.
## 1 INTRODUCTION
A major difficulty of image classification comes from the considerable variability within image classes and the inability of Euclidean distances to measure image similarities. Part of this variability is due to rigid translations, rotations or scaling. This variability is often uninformative for classification and should thus be eliminated. In the framework of kernel classifiers [31], metrics are defined as a Euclidean distance applied on a representation $ \Phi(x) $ of signals x. The operator $ \Phi $ must therefore be invariant to these rigid transformations.
Non-rigid deformations also induce important variability within object classes $ [3] $ , $ [15] $ , $ [34] $ . For instance, in handwritten digit recognition, one must take into account digit deformations due to different writing styles. However, a full deformation invariance would reduce discrimination since a digit can be deformed into a different digit, for example a one into a seven. The representation must therefore not be deformation invariant but continuous to deformations, to handle small deformations with a kernel classifier. A small deformation of an image x into $ x' $ should correspond to a small Euclidean distance $ \|\Phi(x)-\Phi(x')\| $ in the representation space, as further explained in Section 2.
Translation invariant representations can be constructed with registration algorithms $ [32] $ or with the Fourier transform modulus. However, Section 2.1 explains why these invariants are not stable to deformations and hence not adapted to image classification. Trying to avoid Fourier transform instabilities suggests replacing sinusoidal waves by localized waveforms such as wavelets. However, wavelet transforms are not invariant to translations. Building invariant representations from wavelet coefficients requires introducing non-linear operators, which leads to a convolution network architecture.
Deep convolution networks have the ability to build large-scale invariants which are stable to deformations $ [18] $ . They have been applied to a wide range of image classification tasks. Despite the remarkable successes of this neural network architecture, the properties and optimal configurations