Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition

Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition

27 May 2024 | Xun Yang, Tianyu Chang, Tianzhu Zhang, Shanshan Wang, Richang Hong, Meng Wang
The paper "Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition" addresses the issue of deep neural networks learning domain-dependent shortcuts, which leads to poor performance in unseen target domains due to domain shift. The authors propose a Hierarchical Visual Transformation (HVT) network to learn domain-invariant visual representations, aiming to alleviate this domain shift. The HVT network transforms training samples hierarchically into new domains at three levels: Global, Local, and Pixel, maximizing visual discrepancy and minimizing cross-domain feature inconsistency. To enhance the HVT network, the authors introduce environment-invariant learning, ensuring the visual representation remains consistent across inferred environments. This approach helps the model avoid relying on spurious features and effectively learns domain-invariant representations, improving generalization in various visual tasks such as stereo matching, pedestrian retrieval, and image classification. The extended HVT network, termed EHVT, is evaluated on multiple benchmark datasets, demonstrating significant improvements in generalization performance.The paper "Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition" addresses the issue of deep neural networks learning domain-dependent shortcuts, which leads to poor performance in unseen target domains due to domain shift. The authors propose a Hierarchical Visual Transformation (HVT) network to learn domain-invariant visual representations, aiming to alleviate this domain shift. The HVT network transforms training samples hierarchically into new domains at three levels: Global, Local, and Pixel, maximizing visual discrepancy and minimizing cross-domain feature inconsistency. To enhance the HVT network, the authors introduce environment-invariant learning, ensuring the visual representation remains consistent across inferred environments. This approach helps the model avoid relying on spurious features and effectively learns domain-invariant representations, improving generalization in various visual tasks such as stereo matching, pedestrian retrieval, and image classification. The extended HVT network, termed EHVT, is evaluated on multiple benchmark datasets, demonstrating significant improvements in generalization performance.
Reach us at info@study.space
[slides and audio] Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition