The paper presents a hybrid convolution transformer framework for hyperspectral image classification, addressing the challenges of limited labeled data and imbalanced classes. The proposed method combines a residual 3D convolutional neural network (3D-CNN) with a vision transformer, incorporating a sequence aggregation layer to mitigate overfitting. The residual channel attention module captures rich spatial-spectral complementary information, maintaining spectral details during feature extraction. Experiments on three benchmark datasets (Xuzhou, Salinas, and KSC) demonstrate that the proposed model achieves state-of-the-art performance with overall accuracies of 99.75%, 99.46%, and 99.95% using only 5%, 5%, and 10% labeled training samples, respectively. The model outperforms other state-of-the-art methods, particularly in scenarios with limited labeled data, and effectively leverages both local and global information for accurate classification.The paper presents a hybrid convolution transformer framework for hyperspectral image classification, addressing the challenges of limited labeled data and imbalanced classes. The proposed method combines a residual 3D convolutional neural network (3D-CNN) with a vision transformer, incorporating a sequence aggregation layer to mitigate overfitting. The residual channel attention module captures rich spatial-spectral complementary information, maintaining spectral details during feature extraction. Experiments on three benchmark datasets (Xuzhou, Salinas, and KSC) demonstrate that the proposed model achieves state-of-the-art performance with overall accuracies of 99.75%, 99.46%, and 99.95% using only 5%, 5%, and 10% labeled training samples, respectively. The model outperforms other state-of-the-art methods, particularly in scenarios with limited labeled data, and effectively leverages both local and global information for accurate classification.