LLMs are Good Action Recognizers

LLMs are Good Action Recognizers

31 Mar 2024 | Haoxuan Qu1 Yujun Cai2 Jun Liu1†
The paper "LLM-AR: Large Language Model as an Action Recognizer" by Haoxuan Qu, Yujun Cai, and Jun Liu explores the use of large language models (LLMs) for skeleton-based action recognition. The authors observe that LLMs, which are known for their large model architectures and rich implicit knowledge, can be effectively used as action recognizers. To achieve this, they propose a novel framework called LLM-AR, which involves a linguistic projection process to convert skeleton sequences into "action sentences" that are more compatible with LLMs. The framework also incorporates several designs to enhance the linguistic projection process, such as aligning the tokens used in the action sentences with those in the LLM's vocabulary and using a hyperbolic codebook to better represent the tree-like structure of human skeletons. The overall training and testing scheme includes optimizing the action-based VQ-VAE model and performing low-rank adaptation (LoRA) on the LLM to preserve its pre-trained weights. Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed framework, showing state-of-the-art performance. The paper highlights the potential of LLMs in action recognition tasks by leveraging their rich knowledge and large model capabilities.The paper "LLM-AR: Large Language Model as an Action Recognizer" by Haoxuan Qu, Yujun Cai, and Jun Liu explores the use of large language models (LLMs) for skeleton-based action recognition. The authors observe that LLMs, which are known for their large model architectures and rich implicit knowledge, can be effectively used as action recognizers. To achieve this, they propose a novel framework called LLM-AR, which involves a linguistic projection process to convert skeleton sequences into "action sentences" that are more compatible with LLMs. The framework also incorporates several designs to enhance the linguistic projection process, such as aligning the tokens used in the action sentences with those in the LLM's vocabulary and using a hyperbolic codebook to better represent the tree-like structure of human skeletons. The overall training and testing scheme includes optimizing the action-based VQ-VAE model and performing low-rank adaptation (LoRA) on the LLM to preserve its pre-trained weights. Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed framework, showing state-of-the-art performance. The paper highlights the potential of LLMs in action recognition tasks by leveraging their rich knowledge and large model capabilities.
Reach us at info@study.space
Understanding LLMs are Good Action Recognizers