2006 | Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira
The chapter "Analysis of Representations for Domain Adaptation" by Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira, presented by Marina Sokolova, discusses the challenges and solutions in domain adaptation. The authors motivate the need for a common representation to bridge the gap between source and target domains, where the assumption of the same distribution for training and testing data is often violated. They formalize the problem using a bound on the target generalization error of a classifier trained from labeled data in the source domain.
The problem setup involves defining the instance set $\mathcal{X}$, label set $\{0, 1\}$, and feature set $\mathcal{Z}$. The generalization bound theorem provides a formula for the target error $\epsilon_T(h)$ in terms of the source error $\hat{\epsilon}_S(h)$, the size of the labeled and unlabeled samples, and the distance between distributions. A good representation $\mathcal{R}$ should minimize both the training error and the domain distance.
The distance between distributions is defined using the empirical measure, and the domain distance is measured through a classifier trained to discriminate between points from the source and target distributions. The chapter includes an empirical example of adapting a part-of-speech tagger from financial to biomedical domains, demonstrating the effectiveness of the proposed approach.
The contributions of the paper include an analysis of classification problems with different distributions and an upper bound on the generalization error of classifiers trained on a source domain and applied to a target domain. Important references are also provided, including works on detecting change in data streams and domain adaptation with structural correspondence learning.The chapter "Analysis of Representations for Domain Adaptation" by Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira, presented by Marina Sokolova, discusses the challenges and solutions in domain adaptation. The authors motivate the need for a common representation to bridge the gap between source and target domains, where the assumption of the same distribution for training and testing data is often violated. They formalize the problem using a bound on the target generalization error of a classifier trained from labeled data in the source domain.
The problem setup involves defining the instance set $\mathcal{X}$, label set $\{0, 1\}$, and feature set $\mathcal{Z}$. The generalization bound theorem provides a formula for the target error $\epsilon_T(h)$ in terms of the source error $\hat{\epsilon}_S(h)$, the size of the labeled and unlabeled samples, and the distance between distributions. A good representation $\mathcal{R}$ should minimize both the training error and the domain distance.
The distance between distributions is defined using the empirical measure, and the domain distance is measured through a classifier trained to discriminate between points from the source and target distributions. The chapter includes an empirical example of adapting a part-of-speech tagger from financial to biomedical domains, demonstrating the effectiveness of the proposed approach.
The contributions of the paper include an analysis of classification problems with different distributions and an upper bound on the generalization error of classifiers trained on a source domain and applied to a target domain. Important references are also provided, including works on detecting change in data streams and domain adaptation with structural correspondence learning.