A hybrid convolution transformer model is proposed for hyperspectral image classification to address the challenges of limited labeled data and imbalanced classes. The model combines a residual 3D convolutional neural network (3D-CNN) with a vision transformer to extract joint spectral and spatial features from small training samples. The 3D-CNN is used for local feature extraction, while the vision transformer captures global information. A residual channel attention module is introduced to preserve important spectral information, and a sequence aggregation layer is used to mitigate overfitting. The model achieves state-of-the-art performance on three benchmark datasets, with overall accuracy (OA) of 99.75%, 99.46%, and 99.95% using 5%, 5%, and 10% labeled samples, respectively. The model outperforms other state-of-the-art methods in classification accuracy. The hybrid approach effectively leverages both local and global information, enhancing the model's ability to capture spectral-spatial features. The model is evaluated on three datasets: Xuzhou, Salinas, and KSC, demonstrating superior performance in classification tasks. The results show that the hybrid model is effective in handling small sample scenarios and maintaining high accuracy across different classes. The study highlights the importance of integrating convolutional and transformer architectures for hyperspectral image classification, particularly in scenarios with limited labeled data. The proposed model provides a robust solution for hyperspectral image classification by combining the strengths of 3D-CNN and vision transformer.A hybrid convolution transformer model is proposed for hyperspectral image classification to address the challenges of limited labeled data and imbalanced classes. The model combines a residual 3D convolutional neural network (3D-CNN) with a vision transformer to extract joint spectral and spatial features from small training samples. The 3D-CNN is used for local feature extraction, while the vision transformer captures global information. A residual channel attention module is introduced to preserve important spectral information, and a sequence aggregation layer is used to mitigate overfitting. The model achieves state-of-the-art performance on three benchmark datasets, with overall accuracy (OA) of 99.75%, 99.46%, and 99.95% using 5%, 5%, and 10% labeled samples, respectively. The model outperforms other state-of-the-art methods in classification accuracy. The hybrid approach effectively leverages both local and global information, enhancing the model's ability to capture spectral-spatial features. The model is evaluated on three datasets: Xuzhou, Salinas, and KSC, demonstrating superior performance in classification tasks. The results show that the hybrid model is effective in handling small sample scenarios and maintaining high accuracy across different classes. The study highlights the importance of integrating convolutional and transformer architectures for hyperspectral image classification, particularly in scenarios with limited labeled data. The proposed model provides a robust solution for hyperspectral image classification by combining the strengths of 3D-CNN and vision transformer.