IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition

IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition

February 2023 | ZIKANG LENG, AMITRAJIT BHATTACHARJEE, HRUDHAI RAJASEKHAR, LIZHE ZHANG, ELIZABETH BRUDA, HYEOKHYEN KWON, THOMAS PLÖTZ
IMUGPT 2.0 is a language-based cross modality transfer system for sensor-based human activity recognition (HAR). The system uses large language models (LLMs) to generate textual descriptions of activities, which are then converted into motion sequences by a motion synthesis model. A novel motion filter screens out incorrect sequences, retaining only relevant motion sequences for virtual IMU data extraction. A new diversity metric measures shifts in the distribution of generated textual descriptions and motion sequences, allowing for the definition of a stopping criterion that controls when data generation should be stopped for most effective and efficient processing and best downstream activity recognition performance. One of the primary challenges in HAR is the lack of large labeled datasets. To address this, cross modality transfer approaches have been explored that convert existing datasets from a source modality, such as video, to a target modality (IMU). With the emergence of generative AI models such as LLMs and text-driven motion synthesis models, language has become a promising source data modality. In this work, we conduct a large-scale evaluation of language-based cross modality transfer to determine their effectiveness for HAR. Based on this study, we introduce two new extensions for IMUGPT that enhance its use for practical HAR application scenarios: a motion filter capable of filtering out irrelevant motion sequences to ensure the relevance of the generated virtual IMU data, and a set of metrics that measure the diversity of the generated data facilitating the determination of when to stop generating virtual IMU data for both effective and efficient processing. We demonstrate that our diversity metrics can reduce the effort needed for the generation of virtual IMU data by at least 50%, which opens up IMUGPT for practical use cases beyond a mere proof of concept. The key idea is to use LLMs to generate diverse textual descriptions of the different ways that humans can perform certain activities. The generated textual descriptions are then converted to 3D human movement sequences using motion synthesis methods. The resulting sequences of action-specific poses are then converted into virtual IMU training data. In principle, such a system allows for the generation of labeled datasets that are larger and encompass more activities than any existing ones. Such generated datasets would help pave the way for the development of more complex HAR models that are robust, generalizable, and allow for the analysis of more complex human movements and gestures, yet without involving any human participants. While the initial IMUGPT system served as a preliminary proof of concept, it demonstrated promise in small-scale experiments, where the generated virtual IMU data led to significant improvements in performance in the downstream classifier. This paper builds on those initial results by significantly expanding on the proof of concept through a range of technical modifications and additions, that render the approach valuable for practical applications and by thoroughly evaluating it in a large scale experimental evaluation study. Our goal is to determine not only how but also how much and what kind of virtual IMU data shall be derived from the language-based input, for effective and efficientIMUGPT 2.0 is a language-based cross modality transfer system for sensor-based human activity recognition (HAR). The system uses large language models (LLMs) to generate textual descriptions of activities, which are then converted into motion sequences by a motion synthesis model. A novel motion filter screens out incorrect sequences, retaining only relevant motion sequences for virtual IMU data extraction. A new diversity metric measures shifts in the distribution of generated textual descriptions and motion sequences, allowing for the definition of a stopping criterion that controls when data generation should be stopped for most effective and efficient processing and best downstream activity recognition performance. One of the primary challenges in HAR is the lack of large labeled datasets. To address this, cross modality transfer approaches have been explored that convert existing datasets from a source modality, such as video, to a target modality (IMU). With the emergence of generative AI models such as LLMs and text-driven motion synthesis models, language has become a promising source data modality. In this work, we conduct a large-scale evaluation of language-based cross modality transfer to determine their effectiveness for HAR. Based on this study, we introduce two new extensions for IMUGPT that enhance its use for practical HAR application scenarios: a motion filter capable of filtering out irrelevant motion sequences to ensure the relevance of the generated virtual IMU data, and a set of metrics that measure the diversity of the generated data facilitating the determination of when to stop generating virtual IMU data for both effective and efficient processing. We demonstrate that our diversity metrics can reduce the effort needed for the generation of virtual IMU data by at least 50%, which opens up IMUGPT for practical use cases beyond a mere proof of concept. The key idea is to use LLMs to generate diverse textual descriptions of the different ways that humans can perform certain activities. The generated textual descriptions are then converted to 3D human movement sequences using motion synthesis methods. The resulting sequences of action-specific poses are then converted into virtual IMU training data. In principle, such a system allows for the generation of labeled datasets that are larger and encompass more activities than any existing ones. Such generated datasets would help pave the way for the development of more complex HAR models that are robust, generalizable, and allow for the analysis of more complex human movements and gestures, yet without involving any human participants. While the initial IMUGPT system served as a preliminary proof of concept, it demonstrated promise in small-scale experiments, where the generated virtual IMU data led to significant improvements in performance in the downstream classifier. This paper builds on those initial results by significantly expanding on the proof of concept through a range of technical modifications and additions, that render the approach valuable for practical applications and by thoroughly evaluating it in a large scale experimental evaluation study. Our goal is to determine not only how but also how much and what kind of virtual IMU data shall be derived from the language-based input, for effective and efficient
Reach us at info@study.space
Understanding IMUGPT 2.0%3A Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition