Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

14 Mar 2024 | Hugo Laurençon, Léo Tronchon, Victor Sanh
This paper introduces WebSight, a synthetic dataset of 2 million pairs of HTML code and corresponding screenshots, and Sightseer, a vision-language model fine-tuned on WebSight to convert webpage screenshots into functional HTML code. The authors argue that the lack of a suitable, high-quality dataset has hindered progress in this area. WebSight is created by generating HTML code using large language models (LLMs) and then rendering the code to produce screenshots. The dataset is then used to fine-tune a foundational vision-language model (VLM) to improve its ability to convert screenshots into HTML code. The dataset includes 2 million examples, with significant improvements over the initial version, including higher resolution screenshots, richer metadata, and the use of Tailwind CSS for more visually appealing designs. The model, Sightseer, is trained on this dataset and demonstrates proficiency in converting screenshots into HTML code. It also shows the ability to adapt to untrained scenarios, such as converting handwritten sketches into functional HTML code. The authors evaluate the model's performance on various screenshots and find that it accurately preserves text in cases with limited text and can generalize to websites that differ significantly in appearance. However, the model struggles with complex layouts, excessive text, or designs that are significantly different from its training data. The model sometimes produces errors not seen in the initial version, which used traditional CSS instead of Tailwind CSS. The authors hypothesize that this is due to the less frequent occurrence of Tailwind CSS in the pre-training data of the base LLM. The authors conclude that WebSight and Sightseer represent a significant contribution to the field of automating the conversion of webpage screenshots into HTML code. By open-sourcing WebSight, they aim to foster further innovation and research in this area. The paper also discusses related work, including previous attempts to generate code from screenshots using various methods and models.This paper introduces WebSight, a synthetic dataset of 2 million pairs of HTML code and corresponding screenshots, and Sightseer, a vision-language model fine-tuned on WebSight to convert webpage screenshots into functional HTML code. The authors argue that the lack of a suitable, high-quality dataset has hindered progress in this area. WebSight is created by generating HTML code using large language models (LLMs) and then rendering the code to produce screenshots. The dataset is then used to fine-tune a foundational vision-language model (VLM) to improve its ability to convert screenshots into HTML code. The dataset includes 2 million examples, with significant improvements over the initial version, including higher resolution screenshots, richer metadata, and the use of Tailwind CSS for more visually appealing designs. The model, Sightseer, is trained on this dataset and demonstrates proficiency in converting screenshots into HTML code. It also shows the ability to adapt to untrained scenarios, such as converting handwritten sketches into functional HTML code. The authors evaluate the model's performance on various screenshots and find that it accurately preserves text in cases with limited text and can generalize to websites that differ significantly in appearance. However, the model struggles with complex layouts, excessive text, or designs that are significantly different from its training data. The model sometimes produces errors not seen in the initial version, which used traditional CSS instead of Tailwind CSS. The authors hypothesize that this is due to the less frequent occurrence of Tailwind CSS in the pre-training data of the base LLM. The authors conclude that WebSight and Sightseer represent a significant contribution to the field of automating the conversion of webpage screenshots into HTML code. By open-sourcing WebSight, they aim to foster further innovation and research in this area. The paper also discusses related work, including previous attempts to generate code from screenshots using various methods and models.
Reach us at info@study.space