Cantor is a novel multimodal chain-of-thought (CoT) framework designed to enhance the reasoning capabilities of large language models (LLMs) in visual reasoning tasks. The framework integrates visual context with logical reasoning to improve decision-making accuracy and reduce hallucinations. It introduces a perception-decision architecture where an MLLM acts as a decision generator, analyzing visual and textual inputs to form a comprehensive understanding of the problem. The framework then assigns specific tasks to expert modules, which are specialized in extracting and interpreting visual and textual information. These experts include TextIntel Extractor, ObjectQuant Locator, VisionIQ Analyst, and ChartSense Expert, each responsible for different aspects of the reasoning process. The decision generator provides explicit instructions for each expert, ensuring that the reasoning process is structured and accurate. During execution, the MLLM directly processes high-level visual information, reducing the need for external tools and improving the efficiency of the reasoning process. The framework has been tested on two visual reasoning datasets, ScienceQA and MathVista, demonstrating significant improvements in accuracy compared to existing methods. The results show that Cantor outperforms other models in both datasets, achieving state-of-the-art results without requiring fine-tuning or ground-truth rationales. The framework's effectiveness is attributed to its ability to integrate visual information with logical reasoning, enabling more accurate and comprehensive decision-making. The use of expert modules allows the MLLM to handle complex visual reasoning tasks more effectively, demonstrating the potential of multimodal CoT in enhancing the reasoning capabilities of LLMs.Cantor is a novel multimodal chain-of-thought (CoT) framework designed to enhance the reasoning capabilities of large language models (LLMs) in visual reasoning tasks. The framework integrates visual context with logical reasoning to improve decision-making accuracy and reduce hallucinations. It introduces a perception-decision architecture where an MLLM acts as a decision generator, analyzing visual and textual inputs to form a comprehensive understanding of the problem. The framework then assigns specific tasks to expert modules, which are specialized in extracting and interpreting visual and textual information. These experts include TextIntel Extractor, ObjectQuant Locator, VisionIQ Analyst, and ChartSense Expert, each responsible for different aspects of the reasoning process. The decision generator provides explicit instructions for each expert, ensuring that the reasoning process is structured and accurate. During execution, the MLLM directly processes high-level visual information, reducing the need for external tools and improving the efficiency of the reasoning process. The framework has been tested on two visual reasoning datasets, ScienceQA and MathVista, demonstrating significant improvements in accuracy compared to existing methods. The results show that Cantor outperforms other models in both datasets, achieving state-of-the-art results without requiring fine-tuning or ground-truth rationales. The framework's effectiveness is attributed to its ability to integrate visual information with logical reasoning, enabling more accurate and comprehensive decision-making. The use of expert modules allows the MLLM to handle complex visual reasoning tasks more effectively, demonstrating the potential of multimodal CoT in enhancing the reasoning capabilities of LLMs.