| Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox
This paper introduces a large-scale, hierarchical multi-view RGB-D object dataset collected using an RGB-D camera. The dataset contains 300 objects organized into 51 categories and has been made publicly available to the research community. The dataset includes RGB and depth video sequences of 300 common everyday objects from multiple view angles, totaling 250,000 RGB-D images. The objects are organized into a hierarchical category structure using WordNet hyponym/hypernym relations. The dataset is available at http://www.cs.washington.edu/rgbd-dataset.
The RGB-D dataset was collected using a prototype RGB-D camera manufactured by PrimeSense and a firewire camera from Point Grey Research. The RGB-D camera simultaneously records both color and depth images at 640×480 resolution. The dataset includes 8 video sequences of natural scenes, covering common indoor environments. The RGB-D Object Dataset also includes video sequences of objects recorded using a turntable, with each object spun around at constant speed. The cameras are placed about one meter from the turntable, and data was recorded at three different heights relative to the turntable. Each video sequence is recorded at 20 Hz and contains around 250 frames, giving a total of 250,000 RGB + Depth frames in the RGB-D Object Dataset.
The dataset includes segmentation masks and video annotation software. Segmentation is performed using visual cues, depth cues, and rough knowledge of the configuration between the turntable and camera. The segmentation process involves removing most of the background by taking only the points within a 3D bounding box where we expect to find the turntable and object. This prunes most pixels that are far in the background, leaving only the turntable and the object. Using the fact that the object lies above the turntable surface, RANSAC plane fitting is used to find the table plane and take points that lie above it to be the object. This procedure gives very good segmentation for many objects in the dataset, but is still problematic for small, dark, transparent, and reflective objects.
The RGB-D Object Dataset is used for object recognition and detection. The paper evaluates the performance of three state-of-the-art classifiers: linear support vector machine (LinSVM), gaussian kernel support vector machine (kSVM), and random forest (RF). The results show that combining both shape and visual features gives higher overall category-level performance regardless of classification technique. The paper also demonstrates the use of the RGB-D Object Dataset for object detection in real-world scenes. The object detection task is to identify and localize all objects of interest. The paper evaluates the performance of the object detection approach on the 8 natural scene video sequences described in Section IV. The results show that depth features (HOG over depth image and normalized depth histograms) are much better than HOG over RGB image. The best performance is attained by combining image and depth features.This paper introduces a large-scale, hierarchical multi-view RGB-D object dataset collected using an RGB-D camera. The dataset contains 300 objects organized into 51 categories and has been made publicly available to the research community. The dataset includes RGB and depth video sequences of 300 common everyday objects from multiple view angles, totaling 250,000 RGB-D images. The objects are organized into a hierarchical category structure using WordNet hyponym/hypernym relations. The dataset is available at http://www.cs.washington.edu/rgbd-dataset.
The RGB-D dataset was collected using a prototype RGB-D camera manufactured by PrimeSense and a firewire camera from Point Grey Research. The RGB-D camera simultaneously records both color and depth images at 640×480 resolution. The dataset includes 8 video sequences of natural scenes, covering common indoor environments. The RGB-D Object Dataset also includes video sequences of objects recorded using a turntable, with each object spun around at constant speed. The cameras are placed about one meter from the turntable, and data was recorded at three different heights relative to the turntable. Each video sequence is recorded at 20 Hz and contains around 250 frames, giving a total of 250,000 RGB + Depth frames in the RGB-D Object Dataset.
The dataset includes segmentation masks and video annotation software. Segmentation is performed using visual cues, depth cues, and rough knowledge of the configuration between the turntable and camera. The segmentation process involves removing most of the background by taking only the points within a 3D bounding box where we expect to find the turntable and object. This prunes most pixels that are far in the background, leaving only the turntable and the object. Using the fact that the object lies above the turntable surface, RANSAC plane fitting is used to find the table plane and take points that lie above it to be the object. This procedure gives very good segmentation for many objects in the dataset, but is still problematic for small, dark, transparent, and reflective objects.
The RGB-D Object Dataset is used for object recognition and detection. The paper evaluates the performance of three state-of-the-art classifiers: linear support vector machine (LinSVM), gaussian kernel support vector machine (kSVM), and random forest (RF). The results show that combining both shape and visual features gives higher overall category-level performance regardless of classification technique. The paper also demonstrates the use of the RGB-D Object Dataset for object detection in real-world scenes. The object detection task is to identify and localize all objects of interest. The paper evaluates the performance of the object detection approach on the 8 natural scene video sequences described in Section IV. The results show that depth features (HOG over depth image and normalized depth histograms) are much better than HOG over RGB image. The best performance is attained by combining image and depth features.