May 1, 2024 | Sungjune Park, Hyunjun Kim, Yong Man Ro
This paper addresses the challenge of robust pedestrian detection in various real-world applications, such as self-driving systems. The authors propose a novel approach to construct a versatile pedestrian knowledge bank, which contains generalized and task-compatible pedestrian representations. These representations are extracted from a large-scale pre-trained model (CLIP) and curated through vector quantization to ensure they are distinguishable from background scenes. The knowledge bank is then leveraged to enhance pedestrian features within different detection frameworks. Extensive experiments on four public datasets (CrowdHuman, WiderPedestrian, CityPersons, and Caltech) demonstrate the effectiveness and versatility of the proposed method, achieving state-of-the-art performance. The method is shown to be adaptable to various detection frameworks, including region proposal-based and query-based detectors, and to perform well in diverse scene data. The paper also includes ablation studies and visualizations to support the findings, highlighting the importance of the learnable representation hint in improving the quality of pedestrian features.This paper addresses the challenge of robust pedestrian detection in various real-world applications, such as self-driving systems. The authors propose a novel approach to construct a versatile pedestrian knowledge bank, which contains generalized and task-compatible pedestrian representations. These representations are extracted from a large-scale pre-trained model (CLIP) and curated through vector quantization to ensure they are distinguishable from background scenes. The knowledge bank is then leveraged to enhance pedestrian features within different detection frameworks. Extensive experiments on four public datasets (CrowdHuman, WiderPedestrian, CityPersons, and Caltech) demonstrate the effectiveness and versatility of the proposed method, achieving state-of-the-art performance. The method is shown to be adaptable to various detection frameworks, including region proposal-based and query-based detectors, and to perform well in diverse scene data. The paper also includes ablation studies and visualizations to support the findings, highlighting the importance of the learnable representation hint in improving the quality of pedestrian features.