Understanding EGTR%3A Extracting Graph from Transformer for Scene Graph Generation

EGTR is a lightweight one-stage scene graph generation model that extracts relation graphs from the multi-head self-attention layers of the DETR decoder. The model leverages the self-attention by-products to effectively extract relations between objects without requiring a separate triplet detector. By utilizing the attention queries and keys as subject and object entities, the model predicts relations between them using a shallow classifier. To address the dependency between object detection and relation extraction, the model introduces an adaptive smoothing technique that adjusts the relation label based on the quality of detected objects. This enables the model to train with a continuous curriculum that starts with object detection and gradually shifts to multi-task learning. Additionally, the model proposes a connectivity prediction task as an auxiliary task for relation extraction, aiming to predict the existence of relationships between object pairs. The model is evaluated on the Visual Genome and Open Image V6 datasets, demonstrating competitive performance in object detection and triplet detection with high efficiency. The results show that EGTR achieves the best object detection performance and comparable triplet detection performance with the fewest parameters and fastest inference speed. The model's contributions include the efficient generation of scene graphs by utilizing the self-attention by-products, adaptive smoothing for effective multi-task learning, and the connectivity prediction task that aids in relation extraction. The experiments and ablation studies confirm the effectiveness and efficiency of the proposed method.EGTR is a lightweight one-stage scene graph generation model that extracts relation graphs from the multi-head self-attention layers of the DETR decoder. The model leverages the self-attention by-products to effectively extract relations between objects without requiring a separate triplet detector. By utilizing the attention queries and keys as subject and object entities, the model predicts relations between them using a shallow classifier. To address the dependency between object detection and relation extraction, the model introduces an adaptive smoothing technique that adjusts the relation label based on the quality of detected objects. This enables the model to train with a continuous curriculum that starts with object detection and gradually shifts to multi-task learning. Additionally, the model proposes a connectivity prediction task as an auxiliary task for relation extraction, aiming to predict the existence of relationships between object pairs. The model is evaluated on the Visual Genome and Open Image V6 datasets, demonstrating competitive performance in object detection and triplet detection with high efficiency. The results show that EGTR achieves the best object detection performance and comparable triplet detection performance with the fewest parameters and fastest inference speed. The model's contributions include the efficient generation of scene graphs by utilizing the self-attention by-products, adaptive smoothing for effective multi-task learning, and the connectivity prediction task that aids in relation extraction. The experiments and ablation studies confirm the effectiveness and efficiency of the proposed method.

EGTR: Extracting Graph from Transformer for Scene Graph Generation

24 Jun 2024 | Jinbae Im, JeongYeon Nam, Nokyung Park, Hyungmin Lee, Seunghyun Park