Co-speech gestures, when presented in video form, enhance the visual impact in human-machine interaction. Previous works often generate structural human skeletons, which lack appearance information. This paper focuses on directly generating audio-driven co-speech gesture videos, addressing two main challenges: 1) capturing complex human movements with essential appearance information, and 2) aligning gestures and speech temporally, even for arbitrary lengths. To solve these issues, a novel motion-decoupled framework is proposed. Specifically, a nonlinear TPS transformation is introduced to obtain latent motion features that preserve appearance information. A transformer-based diffusion model learns the temporal correlation between gestures and speech, generating in the latent motion space. An optimal motion selection module produces long-term coherent and consistent gesture videos. A refinement network enhances visual perception by focusing on missing details. Extensive experiments show that the proposed framework outperforms existing methods in both motion and video-related evaluations. The code, demos, and resources are available at <https://github.com/thuhcsi/S2G-MDDiffusion>.Co-speech gestures, when presented in video form, enhance the visual impact in human-machine interaction. Previous works often generate structural human skeletons, which lack appearance information. This paper focuses on directly generating audio-driven co-speech gesture videos, addressing two main challenges: 1) capturing complex human movements with essential appearance information, and 2) aligning gestures and speech temporally, even for arbitrary lengths. To solve these issues, a novel motion-decoupled framework is proposed. Specifically, a nonlinear TPS transformation is introduced to obtain latent motion features that preserve appearance information. A transformer-based diffusion model learns the temporal correlation between gestures and speech, generating in the latent motion space. An optimal motion selection module produces long-term coherent and consistent gesture videos. A refinement network enhances visual perception by focusing on missing details. Extensive experiments show that the proposed framework outperforms existing methods in both motion and video-related evaluations. The code, demos, and resources are available at <https://github.com/thuhcsi/S2G-MDDiffusion>.