Playing for Data: Ground Truth from Computer Games

Playing for Data: Ground Truth from Computer Games

7 Aug 2016 | Stephan R. Richter, Vibhav Vineet, Stefan Roth, Vladlen Koltun
This paper presents a method to rapidly generate pixel-accurate semantic label maps for images extracted from modern computer games. The approach leverages the communication between the game and the graphics hardware to reconstruct associations between image patches, enabling the propagation of semantic labels across images without access to the game's source code or content. The method was validated by creating dense pixel-level semantic annotations for 25,000 images from the game Grand Theft Auto V, which took only 49 hours. This is significantly faster than traditional methods, which would have required at least 12 person-years to annotate the same dataset. The pixel-accurate propagation of labels across time and instances allows for rapid annotation, reducing the average annotation time per image sharply. The approach involves intercepting the communication between the game and the graphics hardware, allowing the extraction of resource information such as geometry, textures, and shaders. By hashing these resources, the method generates object signatures that persist across scenes and gameplay sessions. This enables the creation of pixel-accurate object labels without tracing boundaries. The method also allows for the propagation of labels across different instances that share distinctive resources. The dataset created using this method is compatible with other semantic segmentation datasets and has been shown to significantly improve the accuracy of semantic segmentation models when used to supplement real-world images. Experiments on the CamVid and KITTI datasets demonstrate that models trained with game data and just one-third of the CamVid training set outperform models trained on the complete CamVid training set. The dataset is also shown to reduce the need for expensive real-world labeling. The method is particularly effective in creating large-scale pixel-accurate ground truth data for training semantic segmentation systems. The approach is able to generate a diverse set of images from a realistic open-world game, which is highly variable in content and layout. The method is also applicable to many other dense prediction problems, including optical flow, scene flow, depth estimation, boundary detection, stereo reconstruction, and more. The paper concludes that modern game worlds can play a significant role in training artificial vision systems.This paper presents a method to rapidly generate pixel-accurate semantic label maps for images extracted from modern computer games. The approach leverages the communication between the game and the graphics hardware to reconstruct associations between image patches, enabling the propagation of semantic labels across images without access to the game's source code or content. The method was validated by creating dense pixel-level semantic annotations for 25,000 images from the game Grand Theft Auto V, which took only 49 hours. This is significantly faster than traditional methods, which would have required at least 12 person-years to annotate the same dataset. The pixel-accurate propagation of labels across time and instances allows for rapid annotation, reducing the average annotation time per image sharply. The approach involves intercepting the communication between the game and the graphics hardware, allowing the extraction of resource information such as geometry, textures, and shaders. By hashing these resources, the method generates object signatures that persist across scenes and gameplay sessions. This enables the creation of pixel-accurate object labels without tracing boundaries. The method also allows for the propagation of labels across different instances that share distinctive resources. The dataset created using this method is compatible with other semantic segmentation datasets and has been shown to significantly improve the accuracy of semantic segmentation models when used to supplement real-world images. Experiments on the CamVid and KITTI datasets demonstrate that models trained with game data and just one-third of the CamVid training set outperform models trained on the complete CamVid training set. The dataset is also shown to reduce the need for expensive real-world labeling. The method is particularly effective in creating large-scale pixel-accurate ground truth data for training semantic segmentation systems. The approach is able to generate a diverse set of images from a realistic open-world game, which is highly variable in content and layout. The method is also applicable to many other dense prediction problems, including optical flow, scene flow, depth estimation, boundary detection, stereo reconstruction, and more. The paper concludes that modern game worlds can play a significant role in training artificial vision systems.
Reach us at info@study.space