This paper proposes Multi-MaP, a novel multi-modal proxy learning method for personalized visual multiple clustering. The method leverages CLIP encoders to extract coherent text and image embeddings, and integrates GPT-4 to formulate effective textual contexts based on user interests. Reference word constraint and concept-level constraint are designed to learn the optimal text proxy according to the user's interest. Multi-MaP not only captures a user's interest via a keyword but also facilitates identifying relevant clusterings. Extensive experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks. The method addresses the challenge of users not needing all clusterings generated by algorithms, and the need to align user interests with visual components. By using multi-modal models like CLIP, the method bridges the gap between user interests and visual data. The proposed method incorporates both text prompts and unlabeled images from the clustering task, and leverages CLIP to acquire their respective personalized representations using both reference word and concept-level constraints. The contributions of this work include being the first to explore a deep multiple clustering method that precisely captures a user’s interest(s) and generates personalized clustering(s) accordingly, proposing a novel multi-modal proxy learning method, theoretically proving that a close reference token can help constrain the search, and demonstrating that CLIP can uncover different semantic aspects of images. The method is evaluated on multiple benchmark datasets and shows superior performance in clustering tasks.This paper proposes Multi-MaP, a novel multi-modal proxy learning method for personalized visual multiple clustering. The method leverages CLIP encoders to extract coherent text and image embeddings, and integrates GPT-4 to formulate effective textual contexts based on user interests. Reference word constraint and concept-level constraint are designed to learn the optimal text proxy according to the user's interest. Multi-MaP not only captures a user's interest via a keyword but also facilitates identifying relevant clusterings. Extensive experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks. The method addresses the challenge of users not needing all clusterings generated by algorithms, and the need to align user interests with visual components. By using multi-modal models like CLIP, the method bridges the gap between user interests and visual data. The proposed method incorporates both text prompts and unlabeled images from the clustering task, and leverages CLIP to acquire their respective personalized representations using both reference word and concept-level constraints. The contributions of this work include being the first to explore a deep multiple clustering method that precisely captures a user’s interest(s) and generates personalized clustering(s) accordingly, proposing a novel multi-modal proxy learning method, theoretically proving that a close reference token can help constrain the search, and demonstrating that CLIP can uncover different semantic aspects of images. The method is evaluated on multiple benchmark datasets and shows superior performance in clustering tasks.