24 Jun 2024 | Jinbae Im, JeongYeon Nam, Nokyung Park, Hyungmin Lee, Seunghyun Park
Scene Graph Generation (SGG) is a challenging task that involves detecting objects and predicting relationships between them. This paper proposes EGTR, a lightweight one-stage SGG model that leverages the multi-head self-attention layers of the DETR decoder to extract the relation graph. EGTR effectively utilizes the by-products of the object detector, eliminating the need for a separate triplet detector. The model extracts query and key representations from each self-attention layer and predicts relations between them using a shallow classifier. To address the dependency of the relation extraction task on object detection, a novel adaptive smoothing technique is introduced, which adjusts the relation label based on the quality of detected objects. Additionally, a connectivity prediction task is proposed as an auxiliary task to predict the existence of relationships between object pairs. Experiments on the Visual Genome and Open Image V6 datasets demonstrate the effectiveness and efficiency of EGTR, showing superior object detection performance and comparable triplet detection performance with fewer parameters and faster inference speed. The code for EGTR is publicly available at <https://github.com/naver-ai/egtr>.Scene Graph Generation (SGG) is a challenging task that involves detecting objects and predicting relationships between them. This paper proposes EGTR, a lightweight one-stage SGG model that leverages the multi-head self-attention layers of the DETR decoder to extract the relation graph. EGTR effectively utilizes the by-products of the object detector, eliminating the need for a separate triplet detector. The model extracts query and key representations from each self-attention layer and predicts relations between them using a shallow classifier. To address the dependency of the relation extraction task on object detection, a novel adaptive smoothing technique is introduced, which adjusts the relation label based on the quality of detected objects. Additionally, a connectivity prediction task is proposed as an auxiliary task to predict the existence of relationships between object pairs. Experiments on the Visual Genome and Open Image V6 datasets demonstrate the effectiveness and efficiency of EGTR, showing superior object detection performance and comparable triplet detection performance with fewer parameters and faster inference speed. The code for EGTR is publicly available at <https://github.com/naver-ai/egtr>.