This paper introduces *Endora*, an innovative framework for generating high-quality, dynamic, and realistic endoscopy videos. *Endora* integrates a spatial-temporal video transformer with advanced 2D vision foundation model priors to model the complex spatial-temporal dynamics in clinical videos. The authors present the first public benchmark for endoscopy simulation, adapting existing state-of-the-art methods to this domain. *Endora* demonstrates superior visual quality compared to other state-of-the-art methods and explores its applications in downstream video analysis tasks and 3D medical scene generation. The paper also includes a comprehensive evaluation of *Endora* on three public endoscopy video datasets, showing its effectiveness in generating realistic endoscopic videos. Key contributions include the development of a high-fidelity medical video generation framework, the creation of a public benchmark, and the integration of 2D vision foundation model priors to enhance feature extraction. *Endora* sets a strong foundation for future research in medical generative AI.This paper introduces *Endora*, an innovative framework for generating high-quality, dynamic, and realistic endoscopy videos. *Endora* integrates a spatial-temporal video transformer with advanced 2D vision foundation model priors to model the complex spatial-temporal dynamics in clinical videos. The authors present the first public benchmark for endoscopy simulation, adapting existing state-of-the-art methods to this domain. *Endora* demonstrates superior visual quality compared to other state-of-the-art methods and explores its applications in downstream video analysis tasks and 3D medical scene generation. The paper also includes a comprehensive evaluation of *Endora* on three public endoscopy video datasets, showing its effectiveness in generating realistic endoscopic videos. Key contributions include the development of a high-fidelity medical video generation framework, the creation of a public benchmark, and the integration of 2D vision foundation model priors to enhance feature extraction. *Endora* sets a strong foundation for future research in medical generative AI.