The paper "Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts" addresses the challenge of steering a strong model pre-trained on internet-scale data, which can be difficult due to the scarcity of competent supervisors. The authors propose a method that leverages a diverse set of specialized teachers, collectively supervising a strong student model. This approach, inspired by the classical hierarchical mixture-of-experts model, involves two main components:
1. **Teacher Assignment**: The process alternates between student training and teacher assignment, using the evolving competence of the student model to identify the most suitable weak supervisors. This is akin to an Expectation-Maximization (EM) algorithm, where the student's latest iteration serves as a proxy for ground-truth annotations.
2. **Noise Reduction**: The method enforces consistency between the teacher and student models, as well as local-global consistency, to reduce annotation noise. This conservative approach helps in rejecting potentially misleading annotations.
The effectiveness of the proposed method is validated through experiments on visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets. The results show that the method outperforms the vanilla single-teacher baseline by over 15% in the OpenAI benchmark and yields consistent improvements in multi-domain scenarios. The authors hope that their findings will contribute to the field of co-supervised learning and superalignment.The paper "Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts" addresses the challenge of steering a strong model pre-trained on internet-scale data, which can be difficult due to the scarcity of competent supervisors. The authors propose a method that leverages a diverse set of specialized teachers, collectively supervising a strong student model. This approach, inspired by the classical hierarchical mixture-of-experts model, involves two main components:
1. **Teacher Assignment**: The process alternates between student training and teacher assignment, using the evolving competence of the student model to identify the most suitable weak supervisors. This is akin to an Expectation-Maximization (EM) algorithm, where the student's latest iteration serves as a proxy for ground-truth annotations.
2. **Noise Reduction**: The method enforces consistency between the teacher and student models, as well as local-global consistency, to reduce annotation noise. This conservative approach helps in rejecting potentially misleading annotations.
The effectiveness of the proposed method is validated through experiments on visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets. The results show that the method outperforms the vanilla single-teacher baseline by over 15% in the OpenAI benchmark and yields consistent improvements in multi-domain scenarios. The authors hope that their findings will contribute to the field of co-supervised learning and superalignment.