Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction

Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction

27 Mar 2024 | Inhwan Bae, Junoh Lee, Hae-Gon Jeon
This paper proposes LMTraj, a language-based multimodal trajectory predictor that redefines trajectory prediction as a question-answering task. Unlike traditional numerical regression models that treat trajectory coordinates as continuous signals, LMTraj treats them as discrete signals, converting them into text prompts. The method first transforms trajectory coordinates and scene images into text prompts using image captioning and numerical tokenization. It then integrates these prompts into a question-answering template to guide the language model in understanding social relationships and scene context. A numerical tokenizer is trained to separate integer and decimal parts of coordinates, enabling the model to capture correlations between consecutive numbers. The language model is then trained using these prompts, with beam-search and temperature-based techniques for deterministic and stochastic predictions. The model outperforms existing numerical-based predictors on public pedestrian trajectory benchmarks, demonstrating its ability to accurately predict social interactions and multimodal futures. The approach combines prompt engineering, multi-task learning, and language model training to achieve state-of-the-art results in trajectory prediction. The model is evaluated using zero-shot and supervised approaches, showing its effectiveness in both scenarios. The results indicate that language-based models can effectively predict pedestrian trajectories, offering a new approach to trajectory prediction that leverages language understanding and reasoning.This paper proposes LMTraj, a language-based multimodal trajectory predictor that redefines trajectory prediction as a question-answering task. Unlike traditional numerical regression models that treat trajectory coordinates as continuous signals, LMTraj treats them as discrete signals, converting them into text prompts. The method first transforms trajectory coordinates and scene images into text prompts using image captioning and numerical tokenization. It then integrates these prompts into a question-answering template to guide the language model in understanding social relationships and scene context. A numerical tokenizer is trained to separate integer and decimal parts of coordinates, enabling the model to capture correlations between consecutive numbers. The language model is then trained using these prompts, with beam-search and temperature-based techniques for deterministic and stochastic predictions. The model outperforms existing numerical-based predictors on public pedestrian trajectory benchmarks, demonstrating its ability to accurately predict social interactions and multimodal futures. The approach combines prompt engineering, multi-task learning, and language model training to achieve state-of-the-art results in trajectory prediction. The model is evaluated using zero-shot and supervised approaches, showing its effectiveness in both scenarios. The results indicate that language-based models can effectively predict pedestrian trajectories, offering a new approach to trajectory prediction that leverages language understanding and reasoning.
Reach us at info@study.space
Understanding Can Language Beat Numerical Regression%3F Language-Based Multimodal Trajectory Prediction