This paper addresses the challenge of estimating surface normals from a single RGB image, a task that is not affected by scale ambiguity and has a compact output space. The authors propose a method that leverages per-pixel ray direction and encodes the relationship between neighboring surface normals by learning their relative rotation. This approach allows for more accurate and piecewise smooth predictions, even for challenging in-the-wild images with arbitrary resolution and aspect ratios. Compared to a recent state-of-the-art model, the proposed method shows stronger generalization ability despite being trained on a much smaller dataset. The key contributions include:
1. **Per-Pixel Ray Direction**: The method uses dense pixel-wise ray direction as input to the network, enabling camera intrinsics-aware inference and improving generalization.
2. **Ray Direction-Based Activation**: A new activation function is introduced to ensure that the predicted normal is visible, i.e., the angle between the ray direction and the normal is greater than 90°.
3. **Rotation Estimation**: The surface normal estimation is recast as rotation estimation, where the relative rotation between neighboring pixels is estimated using axis-angle representation. This allows for piecewise smooth predictions that are crisp at object boundaries.
The proposed method is evaluated on several datasets, including indoor scenes, dynamic outdoor scenes, and in-the-wild images, demonstrating superior performance in terms of accuracy and detail. The code for the method is available at <https://github.com/baegwangbin/DSINE>.This paper addresses the challenge of estimating surface normals from a single RGB image, a task that is not affected by scale ambiguity and has a compact output space. The authors propose a method that leverages per-pixel ray direction and encodes the relationship between neighboring surface normals by learning their relative rotation. This approach allows for more accurate and piecewise smooth predictions, even for challenging in-the-wild images with arbitrary resolution and aspect ratios. Compared to a recent state-of-the-art model, the proposed method shows stronger generalization ability despite being trained on a much smaller dataset. The key contributions include:
1. **Per-Pixel Ray Direction**: The method uses dense pixel-wise ray direction as input to the network, enabling camera intrinsics-aware inference and improving generalization.
2. **Ray Direction-Based Activation**: A new activation function is introduced to ensure that the predicted normal is visible, i.e., the angle between the ray direction and the normal is greater than 90°.
3. **Rotation Estimation**: The surface normal estimation is recast as rotation estimation, where the relative rotation between neighboring pixels is estimated using axis-angle representation. This allows for piecewise smooth predictions that are crisp at object boundaries.
The proposed method is evaluated on several datasets, including indoor scenes, dynamic outdoor scenes, and in-the-wild images, demonstrating superior performance in terms of accuracy and detail. The code for the method is available at <https://github.com/baegwangbin/DSINE>.