16 Jan 2019 | Yin Cui1,2*, Menglin Jia1 Tsung-Yi Lin3 Yang Song4 Serge Belongie1,2
This paper addresses the challenge of long-tailed data distribution in large-scale, real-world datasets, where a few classes dominate the majority of the data while others are underrepresented. Traditional solutions like re-sampling and re-weighting based on the number of observations for each class are insufficient as the additional benefit of new data points diminishes with increasing sample size. The authors propose a novel theoretical framework to measure data overlap by associating each sample with a small neighboring region, defining the effective number of samples as the volume of samples calculated by a formula $(1 - \beta^n) / (1 - \beta)$, where $n$ is the number of samples and $\beta \in [0, 1)$ is a hyperparameter. They design a re-weighting scheme that uses the effective number of samples to rebalance the loss, resulting in a class-balanced loss. Extensive experiments on CIFAR datasets and large-scale datasets like ImageNet and iNaturalist show that the proposed class-balanced loss significantly improves the performance of models trained on long-tailed datasets. The key contributions include a theoretical framework to quantify the effective number of samples and a class-balanced loss that enhances the performance of commonly used loss functions.This paper addresses the challenge of long-tailed data distribution in large-scale, real-world datasets, where a few classes dominate the majority of the data while others are underrepresented. Traditional solutions like re-sampling and re-weighting based on the number of observations for each class are insufficient as the additional benefit of new data points diminishes with increasing sample size. The authors propose a novel theoretical framework to measure data overlap by associating each sample with a small neighboring region, defining the effective number of samples as the volume of samples calculated by a formula $(1 - \beta^n) / (1 - \beta)$, where $n$ is the number of samples and $\beta \in [0, 1)$ is a hyperparameter. They design a re-weighting scheme that uses the effective number of samples to rebalance the loss, resulting in a class-balanced loss. Extensive experiments on CIFAR datasets and large-scale datasets like ImageNet and iNaturalist show that the proposed class-balanced loss significantly improves the performance of models trained on long-tailed datasets. The key contributions include a theoretical framework to quantify the effective number of samples and a class-balanced loss that enhances the performance of commonly used loss functions.