Weak-to-Strong Search is a method for aligning large language models (LLMs) using small language models as test-time guidance. The approach frames alignment as a greedy search over the log-probability difference between small tuned and untuned models, enabling efficient model scaling without direct fine-tuning of large models. This method serves two purposes: (1) as a compute-efficient strategy for model up-scaling, and (2) as an example of weak-to-strong generalization, where a strong model is enhanced with weak test-time guidance. The algorithm introduces a beam search variant, Chunk-level Beam Search (CBS), which balances reward maximization and KL minimization, applicable to both white-box and black-box models. Empirical results show that weak-to-strong search improves alignment across various tasks, including controlled-sentiment generation, summarization, and instruction-following benchmarks. In the AlpacaEval 2.0 benchmark, it significantly improves the performance of large models against gpt-4-turbo, even when small models have low win rates. The method is flexible, reusable, and effective in diverse alignment scenarios. It avoids the need for training reward or value models from scratch, leveraging existing small models as steering forces. The approach is computationally efficient and demonstrates strong generalization capabilities, particularly in challenging tasks.Weak-to-Strong Search is a method for aligning large language models (LLMs) using small language models as test-time guidance. The approach frames alignment as a greedy search over the log-probability difference between small tuned and untuned models, enabling efficient model scaling without direct fine-tuning of large models. This method serves two purposes: (1) as a compute-efficient strategy for model up-scaling, and (2) as an example of weak-to-strong generalization, where a strong model is enhanced with weak test-time guidance. The algorithm introduces a beam search variant, Chunk-level Beam Search (CBS), which balances reward maximization and KL minimization, applicable to both white-box and black-box models. Empirical results show that weak-to-strong search improves alignment across various tasks, including controlled-sentiment generation, summarization, and instruction-following benchmarks. In the AlpacaEval 2.0 benchmark, it significantly improves the performance of large models against gpt-4-turbo, even when small models have low win rates. The method is flexible, reusable, and effective in diverse alignment scenarios. It avoids the need for training reward or value models from scratch, leveraging existing small models as steering forces. The approach is computationally efficient and demonstrates strong generalization capabilities, particularly in challenging tasks.