Understanding Multisize Dataset Condensation

This paper proposes Multisize Dataset Condensation (MDC), a method that compresses multiple condensation processes into a single process to generate datasets with multiple sizes. The key challenge addressed is the "subset degradation problem," where subsets from a condensed dataset are less representative than directly condensing the full dataset to that size. To mitigate this, MDC introduces an "adaptive subset loss" that dynamically selects the most learnable subset (MLS) during the condensation process. The MLS is determined through three components: feature distance calculation, feature distance comparison, and MLS freezing judgment. The feature distance is used to measure the similarity between subsets and the full dataset, while the MLS freezing judgment ensures that previously learned information is preserved when the subset size changes. Experiments on networks like ConvNet, ResNet, and DenseNet, and datasets like SVHN, CIFAR-10, CIFAR-100, and ImageNet show that MDC achieves significant accuracy improvements, with up to 5.22%-6.40% gains on CIFAR-10 condensed to ten images per class. MDC also reduces storage requirements by reusing condensed images and is more efficient in terms of both storage and computation compared to existing methods. The method is validated through extensive experiments and outperforms state-of-the-art condensation techniques, including DC, DSA, MTT, IDC, and DREAM. The results demonstrate that MDC effectively addresses the subset degradation problem and provides a flexible and efficient solution for dataset condensation.This paper proposes Multisize Dataset Condensation (MDC), a method that compresses multiple condensation processes into a single process to generate datasets with multiple sizes. The key challenge addressed is the "subset degradation problem," where subsets from a condensed dataset are less representative than directly condensing the full dataset to that size. To mitigate this, MDC introduces an "adaptive subset loss" that dynamically selects the most learnable subset (MLS) during the condensation process. The MLS is determined through three components: feature distance calculation, feature distance comparison, and MLS freezing judgment. The feature distance is used to measure the similarity between subsets and the full dataset, while the MLS freezing judgment ensures that previously learned information is preserved when the subset size changes. Experiments on networks like ConvNet, ResNet, and DenseNet, and datasets like SVHN, CIFAR-10, CIFAR-100, and ImageNet show that MDC achieves significant accuracy improvements, with up to 5.22%-6.40% gains on CIFAR-10 condensed to ten images per class. MDC also reduces storage requirements by reusing condensed images and is more efficient in terms of both storage and computation compared to existing methods. The method is validated through extensive experiments and outperforms state-of-the-art condensation techniques, including DC, DSA, MTT, IDC, and DREAM. The results demonstrate that MDC effectively addresses the subset degradation problem and provides a flexible and efficient solution for dataset condensation.

Multisize Dataset Condensation

2024 | Yang He, Lingao Xiao, Joey Tianyi Zhou, Ivor W. Tsang