This paper proposes a novel Harmonious Attention Convolutional Neural Network (HA-CNN) for person re-identification (re-id) that jointly learns attention selection and feature representation in a lightweight CNN architecture. Existing re-id methods either assume well-aligned bounding boxes or use constrained attention mechanisms, which are sub-optimal for misaligned images with large pose variations and detection errors. The HA-CNN model addresses these limitations by jointly learning soft pixel attention and hard regional attention, along with feature representations, to maximize complementary information under re-id discriminative constraints. The model is evaluated on three large-scale benchmarks: CUHK03, Market-1501, and DukeMTMC-ReID, demonstrating superior performance over state-of-the-art methods. The HA-CNN uses a multi-branch architecture with a global branch for global features and local branches for local regions. It incorporates a harmonious attention module that combines soft spatial-channel attention and hard regional attention, along with a cross-attention interaction learning scheme to enhance compatibility between attention and feature representations. The model is lightweight, with only 2.7 million parameters, and achieves high accuracy on both manually labeled and auto-detected images. The HA-CNN outperforms existing methods in terms of Rank-1 and mAP metrics, showing the effectiveness of joint attention learning for re-id. The model is also efficient in terms of training and deployment, making it suitable for real-world applications. The paper also provides detailed analysis of the model's components and compares it with other popular CNN architectures.This paper proposes a novel Harmonious Attention Convolutional Neural Network (HA-CNN) for person re-identification (re-id) that jointly learns attention selection and feature representation in a lightweight CNN architecture. Existing re-id methods either assume well-aligned bounding boxes or use constrained attention mechanisms, which are sub-optimal for misaligned images with large pose variations and detection errors. The HA-CNN model addresses these limitations by jointly learning soft pixel attention and hard regional attention, along with feature representations, to maximize complementary information under re-id discriminative constraints. The model is evaluated on three large-scale benchmarks: CUHK03, Market-1501, and DukeMTMC-ReID, demonstrating superior performance over state-of-the-art methods. The HA-CNN uses a multi-branch architecture with a global branch for global features and local branches for local regions. It incorporates a harmonious attention module that combines soft spatial-channel attention and hard regional attention, along with a cross-attention interaction learning scheme to enhance compatibility between attention and feature representations. The model is lightweight, with only 2.7 million parameters, and achieves high accuracy on both manually labeled and auto-detected images. The HA-CNN outperforms existing methods in terms of Rank-1 and mAP metrics, showing the effectiveness of joint attention learning for re-id. The model is also efficient in terms of training and deployment, making it suitable for real-world applications. The paper also provides detailed analysis of the model's components and compares it with other popular CNN architectures.