This paper proposes a Hierarchical Visual Transformation (HVT) network to learn domain-invariant visual representations for domain generalizable visual matching and recognition. The HVT network first transforms training samples hierarchically into new domains with diverse distributions at three levels: global, local, and pixel. It then maximizes the visual discrepancy between the source domain and new domains while minimizing cross-domain feature inconsistency to capture domain-invariant features. Additionally, the HVT network is enhanced by introducing environment-invariant learning, which enforces visual representation invariance across automatically inferred environments by minimizing an invariant learning loss that considers the weighted average of environmental losses. This approach prevents the model from relying on spurious features for prediction, enabling it to effectively learn domain-invariant representations and narrow the domain gap in various visual matching and recognition tasks, such as stereo matching, pedestrian retrieval, and image classification. The extended HVT is termed EHVT. The EHVT network is integrated into different models and evaluated on several public benchmark datasets, showing substantial improvements in generalization performance. The code is available at https://github.com/cty8998/EHVT-VisualDG. The paper addresses the challenge of training stable deep models that can generalize well to out-of-distribution (OOD) data in open-world scenarios. Existing domain generalization (DG) methods are categorized into three groups: data manipulation, representation learning, and learning strategy. This work unifies the merits of the top-two groups into a single-source DG framework, aiming to learn effective visual transformations to diversify the data in the source domain and help the model learn domain-invariant representations for domain generalization. The proposed EHVT framework is applicable to diverse visual matching and recognition tasks, including stereo matching, pedestrian retrieval, semantic segmentation, and image classification.This paper proposes a Hierarchical Visual Transformation (HVT) network to learn domain-invariant visual representations for domain generalizable visual matching and recognition. The HVT network first transforms training samples hierarchically into new domains with diverse distributions at three levels: global, local, and pixel. It then maximizes the visual discrepancy between the source domain and new domains while minimizing cross-domain feature inconsistency to capture domain-invariant features. Additionally, the HVT network is enhanced by introducing environment-invariant learning, which enforces visual representation invariance across automatically inferred environments by minimizing an invariant learning loss that considers the weighted average of environmental losses. This approach prevents the model from relying on spurious features for prediction, enabling it to effectively learn domain-invariant representations and narrow the domain gap in various visual matching and recognition tasks, such as stereo matching, pedestrian retrieval, and image classification. The extended HVT is termed EHVT. The EHVT network is integrated into different models and evaluated on several public benchmark datasets, showing substantial improvements in generalization performance. The code is available at https://github.com/cty8998/EHVT-VisualDG. The paper addresses the challenge of training stable deep models that can generalize well to out-of-distribution (OOD) data in open-world scenarios. Existing domain generalization (DG) methods are categorized into three groups: data manipulation, representation learning, and learning strategy. This work unifies the merits of the top-two groups into a single-source DG framework, aiming to learn effective visual transformations to diversify the data in the source domain and help the model learn domain-invariant representations for domain generalization. The proposed EHVT framework is applicable to diverse visual matching and recognition tasks, including stereo matching, pedestrian retrieval, semantic segmentation, and image classification.