[slides] Multimodal Machine Learning%3A A Survey and Taxonomy

The paper "Multimodal Machine Learning: A Survey and Taxonomy" by Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency provides an extensive overview of the field of multimodal machine learning. The authors define multimodality as the experience of the world through multiple sensory channels, such as vision, audition, touch, smell, and taste. They emphasize that for Artificial Intelligence (AI) to effectively understand and interpret the world, it must be able to process and relate information from multiple modalities. The paper identifies five core technical challenges in multimodal machine learning: representation, translation, alignment, fusion, and co-learning. These challenges are broader than the typical early and late fusion categorization and are essential for advancing the field. The authors propose a new taxonomy to help researchers better understand the state of the field and identify future research directions. The paper also reviews the historical development of multimodal applications, from early audio-visual speech recognition to recent advancements in language and vision models. It discusses the unique challenges and opportunities presented by multimodal data, such as heterogeneity and the need to capture correspondences between modalities. In detail, the paper explores various techniques for multimodal representations, including joint and coordinated representations. Joint representations combine unimodal signals into a common space, while coordinated representations enforce similarity or structure constraints between modalities. The paper also reviews methods for multimodal translation, which involves mapping entities from one modality to another, and discusses the advantages and limitations of example-based and generative approaches. Overall, the paper provides a comprehensive survey of the recent advances in multimodal machine learning, highlighting the potential and challenges of this interdisciplinary field.The paper "Multimodal Machine Learning: A Survey and Taxonomy" by Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency provides an extensive overview of the field of multimodal machine learning. The authors define multimodality as the experience of the world through multiple sensory channels, such as vision, audition, touch, smell, and taste. They emphasize that for Artificial Intelligence (AI) to effectively understand and interpret the world, it must be able to process and relate information from multiple modalities. The paper identifies five core technical challenges in multimodal machine learning: representation, translation, alignment, fusion, and co-learning. These challenges are broader than the typical early and late fusion categorization and are essential for advancing the field. The authors propose a new taxonomy to help researchers better understand the state of the field and identify future research directions. The paper also reviews the historical development of multimodal applications, from early audio-visual speech recognition to recent advancements in language and vision models. It discusses the unique challenges and opportunities presented by multimodal data, such as heterogeneity and the need to capture correspondences between modalities. In detail, the paper explores various techniques for multimodal representations, including joint and coordinated representations. Joint representations combine unimodal signals into a common space, while coordinated representations enforce similarity or structure constraints between modalities. The paper also reviews methods for multimodal translation, which involves mapping entities from one modality to another, and discusses the advantages and limitations of example-based and generative approaches. Overall, the paper provides a comprehensive survey of the recent advances in multimodal machine learning, highlighting the potential and challenges of this interdisciplinary field.

Multimodal Machine Learning: A Survey and Taxonomy

1 Aug 2017 | Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency