This paper addresses the challenge of 3D human pose estimation from a single color image, a task that has traditionally been approached using a two-step method involving 2D joint localization followed by 3D pose optimization. The authors propose a novel approach that leverages end-to-end learning to directly predict 3D poses from a single image. They make two key contributions: first, they introduce a fine discretization of the 3D space around the subject and train a Convolutional Network (ConvNet) to predict per voxel likelihoods for each joint, improving performance over direct regression of joint coordinates. Second, they employ a coarse-to-fine prediction scheme to handle the high dimensionality of the volumetric representation, enabling iterative refinement and repeated processing of image features. The proposed approach outperforms state-of-the-art methods on standard benchmarks, achieving a relative error reduction of over 30% on average. Additionally, the authors investigate the practical use of their volumetric representation in a decoupled architecture, demonstrating its effectiveness even when end-to-end training is not feasible, such as in the case of in-the-wild images.This paper addresses the challenge of 3D human pose estimation from a single color image, a task that has traditionally been approached using a two-step method involving 2D joint localization followed by 3D pose optimization. The authors propose a novel approach that leverages end-to-end learning to directly predict 3D poses from a single image. They make two key contributions: first, they introduce a fine discretization of the 3D space around the subject and train a Convolutional Network (ConvNet) to predict per voxel likelihoods for each joint, improving performance over direct regression of joint coordinates. Second, they employ a coarse-to-fine prediction scheme to handle the high dimensionality of the volumetric representation, enabling iterative refinement and repeated processing of image features. The proposed approach outperforms state-of-the-art methods on standard benchmarks, achieving a relative error reduction of over 30% on average. Additionally, the authors investigate the practical use of their volumetric representation in a decoupled architecture, demonstrating its effectiveness even when end-to-end training is not feasible, such as in the case of in-the-wild images.