This paper introduces Bootstrapped Preference Optimization (BPO), a method to enhance the visual grounding of Multimodal Large Language Models (MLLMs) by mitigating the bias towards pretraining statistics. The key idea is to use preference learning with datasets containing negative responses generated by the model itself. Two strategies are proposed: 1) using distorted image inputs to elicit responses with pretraining bias, and 2) injecting erroneous but common elements into original responses using text-based LLMs. These responses are paired with annotated responses to form a preference dataset, which is then used for preference learning. The approach suppresses pretraining bias, enabling better alignment with visual inputs. Extensive experiments show significant performance improvements across multiple benchmarks, advancing the state-of-the-art in multimodal conversational systems. The paper also discusses related work, including alignment techniques for large language models and methods for hallucination mitigation in MLLMs. The proposed BPO method is compared with other approaches like SFT and DPO, demonstrating its effectiveness in improving visual truthfulness and helpfulness. The results show that BPO outperforms these methods in terms of performance and sample efficiency. The paper concludes that BPO is a promising solution for improving the visual grounding of MLLMs.This paper introduces Bootstrapped Preference Optimization (BPO), a method to enhance the visual grounding of Multimodal Large Language Models (MLLMs) by mitigating the bias towards pretraining statistics. The key idea is to use preference learning with datasets containing negative responses generated by the model itself. Two strategies are proposed: 1) using distorted image inputs to elicit responses with pretraining bias, and 2) injecting erroneous but common elements into original responses using text-based LLMs. These responses are paired with annotated responses to form a preference dataset, which is then used for preference learning. The approach suppresses pretraining bias, enabling better alignment with visual inputs. Extensive experiments show significant performance improvements across multiple benchmarks, advancing the state-of-the-art in multimodal conversational systems. The paper also discusses related work, including alignment techniques for large language models and methods for hallucination mitigation in MLLMs. The proposed BPO method is compared with other approaches like SFT and DPO, demonstrating its effectiveness in improving visual truthfulness and helpfulness. The results show that BPO outperforms these methods in terms of performance and sample efficiency. The paper concludes that BPO is a promising solution for improving the visual grounding of MLLMs.