2011 | Daniel Weinland, Rémi Ronfard, Edmond Boyer
This paper provides a comprehensive survey of vision-based methods for action representation, segmentation, and recognition. The authors categorize these methods based on how they represent the spatial and temporal structure of actions, how they segment actions from visual data, and how they learn a view-invariant representation of actions. The survey covers a wide range of approaches, including body models, image models, local statistics, action grammars, templates, and temporal statistics. Each category is detailed with specific examples and discussions of their advantages and limitations. The paper also explores the challenges of action segmentation, view-invariance, and experimental evaluations on publicly available datasets. The authors emphasize the importance of combining spatial and temporal models to handle complex human actions effectively.This paper provides a comprehensive survey of vision-based methods for action representation, segmentation, and recognition. The authors categorize these methods based on how they represent the spatial and temporal structure of actions, how they segment actions from visual data, and how they learn a view-invariant representation of actions. The survey covers a wide range of approaches, including body models, image models, local statistics, action grammars, templates, and temporal statistics. Each category is detailed with specific examples and discussions of their advantages and limitations. The paper also explores the challenges of action segmentation, view-invariance, and experimental evaluations on publicly available datasets. The authors emphasize the importance of combining spatial and temporal models to handle complex human actions effectively.