This paper introduces a recursive joint cross-modal attention (RJCMA) model for dimensional emotion recognition, which effectively captures both intra- and inter-modal relationships across audio, visual, and text modalities. The model computes attention weights based on cross-correlation between joint audio-visual-text feature representations and individual modality features, enabling simultaneous capture of intra- and inter-modal relationships. The attended features of individual modalities are then fed back into the fusion model in a recursive mechanism to obtain more refined feature representations. Temporal Convolutional Networks (TCNs) are also used to improve the temporal modeling of individual feature representations. The proposed model is evaluated on the challenging Affwild2 dataset, achieving a Concordance Correlation Coefficient (CCC) of 0.585 (0.542) and 0.674 (0.619) for valence and arousal on the validation set (test set), respectively. This represents a significant improvement over the baseline of 0.240 (0.211) and 0.200 (0.191) for valence and arousal, and the model achieved second place in the valence-arousal challenge of the 6th Affective Behavior Analysis in-the-Wild (ABAW) competition. The code is available on GitHub: https://github.com/praveena2j/RJCMA.This paper introduces a recursive joint cross-modal attention (RJCMA) model for dimensional emotion recognition, which effectively captures both intra- and inter-modal relationships across audio, visual, and text modalities. The model computes attention weights based on cross-correlation between joint audio-visual-text feature representations and individual modality features, enabling simultaneous capture of intra- and inter-modal relationships. The attended features of individual modalities are then fed back into the fusion model in a recursive mechanism to obtain more refined feature representations. Temporal Convolutional Networks (TCNs) are also used to improve the temporal modeling of individual feature representations. The proposed model is evaluated on the challenging Affwild2 dataset, achieving a Concordance Correlation Coefficient (CCC) of 0.585 (0.542) and 0.674 (0.619) for valence and arousal on the validation set (test set), respectively. This represents a significant improvement over the baseline of 0.240 (0.211) and 0.200 (0.191) for valence and arousal, and the model achieved second place in the valence-arousal challenge of the 6th Affective Behavior Analysis in-the-Wild (ABAW) competition. The code is available on GitHub: https://github.com/praveena2j/RJCMA.