This paper introduces a novel framework for personalized face generation that enables simultaneous control of identity and expression, with fine-grained expression synthesis. The framework takes three inputs: a text prompt describing the background, a selfie photo uploaded by the user, and a text related to the fine-grained expression labels. The generated faces match the inputted triples and exhibit fine-grained expression synthesis. The framework proposes a novel diffusion model capable of simultaneous face swapping and reenactment (SFSR). The model addresses the challenge of controlling identity and expression in a unified framework by introducing several innovative designs, including balancing identity and expression encoders, improved midpoint sampling, and explicit background conditioning. The model is evaluated against state-of-the-art text-to-image, face swapping, and face reenactment methods, demonstrating its controllability and scalability. The framework also supports fine-grained expression description using a 135-word expression dictionary. The model is trained on the CelebA-HQ and FFHQ datasets and evaluated on the CelebA-HQ and FF++ datasets. The results show that the framework can achieve fine-grained expression control while maintaining identity consistency. The framework outperforms existing methods in all metrics, including identity and expression consistency, realism, and image quality. The model is also compared with other face swapping and reenactment methods, showing its effectiveness in generating accurate expressions and poses. The framework's improved midpoint sampling method reduces information loss and improves image reconstruction performance. The model is also evaluated through user studies, demonstrating its ability to generate consistent expressions and realistic images. The framework's contributions include a novel face generation framework, a novel face manipulation task, and three innovative designs in the conditional diffusion model. The framework's results show that it can achieve high-fidelity and identity-expression preserving portraits. The framework's limitations include the inability to fully reflect semantic information from text labels and the potential for ambiguous expression labels to cause overlapping semantics. The framework's results are available on the project homepage.This paper introduces a novel framework for personalized face generation that enables simultaneous control of identity and expression, with fine-grained expression synthesis. The framework takes three inputs: a text prompt describing the background, a selfie photo uploaded by the user, and a text related to the fine-grained expression labels. The generated faces match the inputted triples and exhibit fine-grained expression synthesis. The framework proposes a novel diffusion model capable of simultaneous face swapping and reenactment (SFSR). The model addresses the challenge of controlling identity and expression in a unified framework by introducing several innovative designs, including balancing identity and expression encoders, improved midpoint sampling, and explicit background conditioning. The model is evaluated against state-of-the-art text-to-image, face swapping, and face reenactment methods, demonstrating its controllability and scalability. The framework also supports fine-grained expression description using a 135-word expression dictionary. The model is trained on the CelebA-HQ and FFHQ datasets and evaluated on the CelebA-HQ and FF++ datasets. The results show that the framework can achieve fine-grained expression control while maintaining identity consistency. The framework outperforms existing methods in all metrics, including identity and expression consistency, realism, and image quality. The model is also compared with other face swapping and reenactment methods, showing its effectiveness in generating accurate expressions and poses. The framework's improved midpoint sampling method reduces information loss and improves image reconstruction performance. The model is also evaluated through user studies, demonstrating its ability to generate consistent expressions and realistic images. The framework's contributions include a novel face generation framework, a novel face manipulation task, and three innovative designs in the conditional diffusion model. The framework's results show that it can achieve high-fidelity and identity-expression preserving portraits. The framework's limitations include the inability to fully reflect semantic information from text labels and the potential for ambiguous expression labels to cause overlapping semantics. The framework's results are available on the project homepage.