This paper proposes a unified CNN-RNN framework for multi-label image classification. The framework combines convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to learn a joint image-label embedding that captures both semantic label dependencies and image-label relevance. The CNN part extracts semantic features from images, while the RNN part models label co-occurrence dependencies in a joint embedding space. The framework is trained end-to-end and can effectively learn both semantic redundancy and label co-occurrence dependencies. Experimental results on public benchmark datasets (NUS-WIDE, Microsoft COCO, and PASCAL VOC 2007) show that the proposed method achieves significantly better performance compared to state-of-the-art multi-label classification methods. The framework also demonstrates the ability to focus attention on different image regions when predicting different labels, which is similar to human multi-label classification. The model uses an LSTM-based RNN to model high-order label co-occurrence dependencies and an implicit attention mechanism to adapt image features for better prediction of small objects. The framework is evaluated using precision, recall, and F1 scores, and it outperforms other methods in these metrics. The model also shows strong performance on large objects and those with high dependencies, but struggles with small objects due to the limited discriminative ability of global image features. The attention visualization shows that the model can steer its attention to different image regions when predicting different labels, even without an explicit attention model. The proposed framework is a unified approach that combines the advantages of joint image-label embedding and label co-occurrence modeling.This paper proposes a unified CNN-RNN framework for multi-label image classification. The framework combines convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to learn a joint image-label embedding that captures both semantic label dependencies and image-label relevance. The CNN part extracts semantic features from images, while the RNN part models label co-occurrence dependencies in a joint embedding space. The framework is trained end-to-end and can effectively learn both semantic redundancy and label co-occurrence dependencies. Experimental results on public benchmark datasets (NUS-WIDE, Microsoft COCO, and PASCAL VOC 2007) show that the proposed method achieves significantly better performance compared to state-of-the-art multi-label classification methods. The framework also demonstrates the ability to focus attention on different image regions when predicting different labels, which is similar to human multi-label classification. The model uses an LSTM-based RNN to model high-order label co-occurrence dependencies and an implicit attention mechanism to adapt image features for better prediction of small objects. The framework is evaluated using precision, recall, and F1 scores, and it outperforms other methods in these metrics. The model also shows strong performance on large objects and those with high dependencies, but struggles with small objects due to the limited discriminative ability of global image features. The attention visualization shows that the model can steer its attention to different image regions when predicting different labels, even without an explicit attention model. The proposed framework is a unified approach that combines the advantages of joint image-label embedding and label co-occurrence modeling.