DTLLM-VLT is a method for generating diverse text descriptions for visual language tracking (VLT) tasks, leveraging large language models (LLMs) to enhance environmental diversity and improve tracking performance. The method generates multi-granularity text descriptions using a cohesive prompt framework, enabling seamless integration into various visual tracking benchmarks. It addresses the limitations of existing VLT benchmarks, which often lack a unified semantic framework and focus on a single granularity. DTLLM-VLT generates text for three prominent benchmarks: short-term tracking, long-term tracking, and global instance tracking, offering four granularity combinations to showcase its versatility. Comparative experiments on VLT benchmarks with different text granularities evaluate the impact of diverse text on tracking performance, demonstrating that DTLLM-VLT can provide a more comprehensive and diverse environment for algorithm evaluation. The method generates a large amount of text, including initial and dense concise/detailed descriptions, which can be used to improve tracking performance. DTLLM-VLT is based on SAM and Osprey, which can generate large-scale and diverse text for SOT and VLT datasets at low cost. The generated text includes a wide range of semantic descriptions, providing a rich vocabulary for tracking. The method is evaluated on three benchmarks: OTB99_Lang, LaSOT, and MGIT, showing that DTLLM-VLT improves tracking performance on short-term and long-term tracking tasks. The results indicate that the generated text helps trackers address challenges such as object appearance changes and complex spatio-temporal relationships. The method also shows that the text processing and multi-modal alignment abilities of the algorithm need improvement to fully leverage temporal and spatial relationships. DTLLM-VLT provides a new approach for VLT tasks, enabling fine-grained evaluation of multi-modal trackers and supporting future research on vision datasets.DTLLM-VLT is a method for generating diverse text descriptions for visual language tracking (VLT) tasks, leveraging large language models (LLMs) to enhance environmental diversity and improve tracking performance. The method generates multi-granularity text descriptions using a cohesive prompt framework, enabling seamless integration into various visual tracking benchmarks. It addresses the limitations of existing VLT benchmarks, which often lack a unified semantic framework and focus on a single granularity. DTLLM-VLT generates text for three prominent benchmarks: short-term tracking, long-term tracking, and global instance tracking, offering four granularity combinations to showcase its versatility. Comparative experiments on VLT benchmarks with different text granularities evaluate the impact of diverse text on tracking performance, demonstrating that DTLLM-VLT can provide a more comprehensive and diverse environment for algorithm evaluation. The method generates a large amount of text, including initial and dense concise/detailed descriptions, which can be used to improve tracking performance. DTLLM-VLT is based on SAM and Osprey, which can generate large-scale and diverse text for SOT and VLT datasets at low cost. The generated text includes a wide range of semantic descriptions, providing a rich vocabulary for tracking. The method is evaluated on three benchmarks: OTB99_Lang, LaSOT, and MGIT, showing that DTLLM-VLT improves tracking performance on short-term and long-term tracking tasks. The results indicate that the generated text helps trackers address challenges such as object appearance changes and complex spatio-temporal relationships. The method also shows that the text processing and multi-modal alignment abilities of the algorithm need improvement to fully leverage temporal and spatial relationships. DTLLM-VLT provides a new approach for VLT tasks, enabling fine-grained evaluation of multi-modal trackers and supporting future research on vision datasets.