12 Jul 2024 | Zeyu Liu† Weicong Liang† Yiming Zhao† Bohan Chen† Lin Liang Lijuan Wang Ji Li Yuhui Yuan‡
The paper "Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering" addresses the limitations of existing visual text rendering models, particularly their focus on English and poor aesthetic quality. The authors introduce Glyph-ByT5-v2 and Glyph-SDXL-v2, which support accurate visual text rendering for 10 different languages and enhance aesthetic quality. Key contributions include:
1. **Dataset Creation**: Development of a high-quality multilingual glyph-text and graphic design dataset, consisting of 1 million glyph-text pairs and 10 million graphic design image-text pairs.
2. **Benchmark Development**: Creation of a multilingual visual paragraph benchmark with 1,000 prompts to assess visual spelling accuracy.
3. **Aesthetic Enhancement**: Implementation of step-aware preference learning and albedo techniques to improve visual aesthetics.
The approach combines these techniques to create a powerful multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2. The paper demonstrates the effectiveness of these models through a user study and comparisons with DALL-E3, showing superior performance in visual text rendering and aesthetic quality.The paper "Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering" addresses the limitations of existing visual text rendering models, particularly their focus on English and poor aesthetic quality. The authors introduce Glyph-ByT5-v2 and Glyph-SDXL-v2, which support accurate visual text rendering for 10 different languages and enhance aesthetic quality. Key contributions include:
1. **Dataset Creation**: Development of a high-quality multilingual glyph-text and graphic design dataset, consisting of 1 million glyph-text pairs and 10 million graphic design image-text pairs.
2. **Benchmark Development**: Creation of a multilingual visual paragraph benchmark with 1,000 prompts to assess visual spelling accuracy.
3. **Aesthetic Enhancement**: Implementation of step-aware preference learning and albedo techniques to improve visual aesthetics.
The approach combines these techniques to create a powerful multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2. The paper demonstrates the effectiveness of these models through a user study and comparisons with DALL-E3, showing superior performance in visual text rendering and aesthetic quality.