LongVLM is a VideoLLM designed for efficient long video understanding. It addresses the challenge of detailed understanding in long-term videos by decomposing them into short-term segments and encoding local features for each segment through a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across segments. Global semantics are integrated into each local feature to enhance context understanding, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets show that LongVLM outperforms previous state-of-the-art methods in terms of accuracy and response quality. The model effectively captures both local and global information, allowing for precise and accurate responses to long-term video content. LongVLM is efficient and maintains affordable computational costs while achieving superior performance in long video understanding.LongVLM is a VideoLLM designed for efficient long video understanding. It addresses the challenge of detailed understanding in long-term videos by decomposing them into short-term segments and encoding local features for each segment through a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across segments. Global semantics are integrated into each local feature to enhance context understanding, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets show that LongVLM outperforms previous state-of-the-art methods in terms of accuracy and response quality. The model effectively captures both local and global information, allowing for precise and accurate responses to long-term video content. LongVLM is efficient and maintains affordable computational costs while achieving superior performance in long video understanding.