This paper introduces Vary-toy, a small-scale vision-language model (LVLM) that achieves performance comparable to large models like Qwen-VL-7B and LLaVA-7B on various benchmarks. Vary-toy is based on the Qwen-1.8B language model and incorporates an improved vision vocabulary to enhance its ability to understand and generate text from images. The vision vocabulary is generated using a combination of PDF image-text pairs and object detection data, which allows the model to better encode visual information related to natural objects. The model is trained on a variety of tasks, including document understanding, image captioning, and visual question answering. Vary-toy achieves 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1% accuracy on RefCOCO, and 29% accuracy on MMVet. The model is designed to be efficient and can run on a single GTX1080ti GPU, making it accessible to researchers with limited resources. The paper also discusses the architecture of Vary-toy, including the use of a vision vocabulary network and the integration of the model with the Qwen-1.8B language model. The results show that Vary-toy performs well on a range of tasks and has the potential to be a useful baseline for future research in LVLMs.This paper introduces Vary-toy, a small-scale vision-language model (LVLM) that achieves performance comparable to large models like Qwen-VL-7B and LLaVA-7B on various benchmarks. Vary-toy is based on the Qwen-1.8B language model and incorporates an improved vision vocabulary to enhance its ability to understand and generate text from images. The vision vocabulary is generated using a combination of PDF image-text pairs and object detection data, which allows the model to better encode visual information related to natural objects. The model is trained on a variety of tasks, including document understanding, image captioning, and visual question answering. Vary-toy achieves 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1% accuracy on RefCOCO, and 29% accuracy on MMVet. The model is designed to be efficient and can run on a single GTX1080ti GPU, making it accessible to researchers with limited resources. The paper also discusses the architecture of Vary-toy, including the use of a vision vocabulary network and the integration of the model with the Qwen-1.8B language model. The results show that Vary-toy performs well on a range of tasks and has the potential to be a useful baseline for future research in LVLMs.