14 Mar 2024 | Hugo Laurençon, Léo Tronchon, Victor Sanh
This paper introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots, aimed at enabling the conversion of web screenshots into functional HTML code using vision-language models (VLMs). The authors highlight the lack of a suitable, high-quality dataset as a primary challenge in this task. They address this by developing WebSight, which includes diverse and high-quality examples of HTML code and screenshots. The dataset is fine-tuned on a foundational VLM, resulting in the model Sightseer, which demonstrates proficiency in converting webpage screenshots into HTML code. The paper also discusses the challenges and limitations of the current approach, such as the complexity of HTML files and the need for cleaner, more structured data. The authors open-source WebSight to accelerate research and development in this area, providing a valuable resource for UI developers and no-code solutions.This paper introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots, aimed at enabling the conversion of web screenshots into functional HTML code using vision-language models (VLMs). The authors highlight the lack of a suitable, high-quality dataset as a primary challenge in this task. They address this by developing WebSight, which includes diverse and high-quality examples of HTML code and screenshots. The dataset is fine-tuned on a foundational VLM, resulting in the model Sightseer, which demonstrates proficiency in converting webpage screenshots into HTML code. The paper also discusses the challenges and limitations of the current approach, such as the complexity of HTML files and the need for cleaner, more structured data. The authors open-source WebSight to accelerate research and development in this area, providing a valuable resource for UI developers and no-code solutions.