31 May 2016 | Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell
This paper introduces Long-term Recurrent Convolutional Networks (LRCNs), a novel architecture combining convolutional layers and long-term temporal recurrence for visual recognition and description. LRCNs are end-to-end trainable and suitable for large-scale visual understanding tasks. They are particularly effective for activity recognition, image captioning, and video description. Unlike previous models that assume fixed visual representations or use simple temporal averaging, LRCNs are "doubly deep," learning compositional representations in both space and time. They incorporate nonlinearities into network state updates to learn long-term dependencies. LRCNs can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and model complex temporal dynamics, and can be optimized with backpropagation. LRCNs are connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. The results show that LRCNs have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.
The paper explores three applications: activity recognition, image captioning, and video description. For activity recognition, LRCNs are trained to predict video activity classes at each time step. For image captioning, LRCNs are trained to generate natural language descriptions of images. For video description, LRCNs are trained to generate natural language descriptions of videos. The results show that LRCNs outperform previous models in all three tasks. In activity recognition, LRCNs achieve higher accuracy than single-frame models. In image captioning, LRCNs generate more accurate and diverse captions than previous models. In video description, LRCNs generate more accurate and diverse descriptions than previous models. The paper also discusses the advantages of LRCNs, including their ability to model complex temporal dynamics and their end-to-end trainability. The paper concludes that LRCNs are a promising approach for visual recognition and description tasks.This paper introduces Long-term Recurrent Convolutional Networks (LRCNs), a novel architecture combining convolutional layers and long-term temporal recurrence for visual recognition and description. LRCNs are end-to-end trainable and suitable for large-scale visual understanding tasks. They are particularly effective for activity recognition, image captioning, and video description. Unlike previous models that assume fixed visual representations or use simple temporal averaging, LRCNs are "doubly deep," learning compositional representations in both space and time. They incorporate nonlinearities into network state updates to learn long-term dependencies. LRCNs can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and model complex temporal dynamics, and can be optimized with backpropagation. LRCNs are connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. The results show that LRCNs have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.
The paper explores three applications: activity recognition, image captioning, and video description. For activity recognition, LRCNs are trained to predict video activity classes at each time step. For image captioning, LRCNs are trained to generate natural language descriptions of images. For video description, LRCNs are trained to generate natural language descriptions of videos. The results show that LRCNs outperform previous models in all three tasks. In activity recognition, LRCNs achieve higher accuracy than single-frame models. In image captioning, LRCNs generate more accurate and diverse captions than previous models. In video description, LRCNs generate more accurate and diverse descriptions than previous models. The paper also discusses the advantages of LRCNs, including their ability to model complex temporal dynamics and their end-to-end trainability. The paper concludes that LRCNs are a promising approach for visual recognition and description tasks.