LLMs are Good Action Recognizers

LLMs are Good Action Recognizers

31 Mar 2024 | Haoxuan Qu, Yujun Cai, Jun Liu
This paper proposes a novel framework called LLM-AR (Large Language Model as an Action Recognizer) for skeleton-based action recognition. The framework leverages large language models (LLMs), which have been extensively used in natural language processing tasks, to recognize human actions. LLMs are capable of handling large amounts of data and possess rich implicit knowledge, making them suitable for action recognition tasks. The key idea of LLM-AR is to treat the large language model as an action recognizer by first projecting the input action signal (skeleton sequence) into a "sentence format" called an "action sentence." This projection is achieved through a linguistic projection process, which involves learning a vector quantized variational autoencoder (VQ-VAE) model to convert the skeleton sequence into a sequence of discrete tokens. The resulting "action sentences" are then fed into the large language model to predict the corresponding action. To enhance the effectiveness of the "action sentences," the framework incorporates several design strategies. These include aligning the tokens used in the "action sentences" with those used in the large language model, incorporating human inductive biases (such as Zipf's law and context-sensitivity), and using a hyperbolic codebook to better represent the tree-like structure of human skeletons. The framework is evaluated on four datasets: NTU RGB+D, NTU RGB+D 120, Toyota Smarthome, and UAV-Human. The results show that the proposed LLM-AR framework achieves state-of-the-art performance on these datasets. Additionally, ablation studies demonstrate that the framework's effectiveness is due to the incorporation of human inductive biases, the use of discrete tokens, and the hyperbolic codebook. The framework also maintains the pre-trained knowledge of the large language model, which is crucial for accurate action recognition.This paper proposes a novel framework called LLM-AR (Large Language Model as an Action Recognizer) for skeleton-based action recognition. The framework leverages large language models (LLMs), which have been extensively used in natural language processing tasks, to recognize human actions. LLMs are capable of handling large amounts of data and possess rich implicit knowledge, making them suitable for action recognition tasks. The key idea of LLM-AR is to treat the large language model as an action recognizer by first projecting the input action signal (skeleton sequence) into a "sentence format" called an "action sentence." This projection is achieved through a linguistic projection process, which involves learning a vector quantized variational autoencoder (VQ-VAE) model to convert the skeleton sequence into a sequence of discrete tokens. The resulting "action sentences" are then fed into the large language model to predict the corresponding action. To enhance the effectiveness of the "action sentences," the framework incorporates several design strategies. These include aligning the tokens used in the "action sentences" with those used in the large language model, incorporating human inductive biases (such as Zipf's law and context-sensitivity), and using a hyperbolic codebook to better represent the tree-like structure of human skeletons. The framework is evaluated on four datasets: NTU RGB+D, NTU RGB+D 120, Toyota Smarthome, and UAV-Human. The results show that the proposed LLM-AR framework achieves state-of-the-art performance on these datasets. Additionally, ablation studies demonstrate that the framework's effectiveness is due to the incorporation of human inductive biases, the use of discrete tokens, and the hyperbolic codebook. The framework also maintains the pre-trained knowledge of the large language model, which is crucial for accurate action recognition.
Reach us at info@study.space
[slides and audio] LLMs are Good Action Recognizers