FROSTER is an effective framework for open-vocabulary action recognition, leveraging the CLIP model's strong generalization capability. The CLIP model, pretrained on massive image-text pairs, has shown success in image-based tasks but lacks temporal information, making it challenging for action recognition. To address this, FROSTER employs a residual feature distillation approach, using a frozen CLIP as a teacher to maintain generalization while adapting to video tasks. This method bridges the gap between images and videos by supervising feature learning for video-specific features. A residual sub-network balances the objectives of learning generalizable and video-specific features. FROSTER is evaluated on open-vocabulary benchmarks under base-to-novel and cross-dataset settings, consistently achieving state-of-the-art performance. The framework is compatible with various network architectures and demonstrates strong generalization. The key contributions include introducing FROSTER, a residual feature distillation approach to balance feature learning, and extensive evaluations showing its effectiveness. The method improves video-specific knowledge and generalizability, enhancing overall performance. FROSTER outperforms existing methods in both base-to-novel and cross-dataset settings, demonstrating its potential for open-world applications. The framework is implemented with a residual feature distillation method that allows flexible adaptation from images to videos without sacrificing the generalization capability of the pretrained CLIP. The method is validated through experiments showing its effectiveness in improving action recognition performance.FROSTER is an effective framework for open-vocabulary action recognition, leveraging the CLIP model's strong generalization capability. The CLIP model, pretrained on massive image-text pairs, has shown success in image-based tasks but lacks temporal information, making it challenging for action recognition. To address this, FROSTER employs a residual feature distillation approach, using a frozen CLIP as a teacher to maintain generalization while adapting to video tasks. This method bridges the gap between images and videos by supervising feature learning for video-specific features. A residual sub-network balances the objectives of learning generalizable and video-specific features. FROSTER is evaluated on open-vocabulary benchmarks under base-to-novel and cross-dataset settings, consistently achieving state-of-the-art performance. The framework is compatible with various network architectures and demonstrates strong generalization. The key contributions include introducing FROSTER, a residual feature distillation approach to balance feature learning, and extensive evaluations showing its effectiveness. The method improves video-specific knowledge and generalizability, enhancing overall performance. FROSTER outperforms existing methods in both base-to-novel and cross-dataset settings, demonstrating its potential for open-world applications. The framework is implemented with a residual feature distillation method that allows flexible adaptation from images to videos without sacrificing the generalization capability of the pretrained CLIP. The method is validated through experiments showing its effectiveness in improving action recognition performance.