[slides and audio] BM25S%3A Orders of magnitude faster lexical search via eager sparse scoring

BM25S is an efficient Python-based implementation of the BM25 algorithm, which significantly improves the speed of lexical search by eagerly computing BM25 scores during indexing and storing them in sparse matrices. This approach achieves up to a 500x speedup compared to popular Python-based frameworks and highly optimized Java-based implementations used by commercial products. BM25S also reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) using a novel score shifting method. The implementation includes a simple but fast Python-based tokenizer that combines Scikit-Learn's text splitting, Elastic's stopword list, and an optional C-based Snowball stemmer. It also supports top-k retrieval with an average time complexity of $O(n)$ for selecting the top-$k$ most relevant documents. BM25S is designed to be minimally dependent, making it suitable for scenarios with limited storage and for use in web browsers via WebAssembly frameworks like Pyodide and Pyscript. The code is available at <https://github.com/xhlucabm25s>.BM25S is an efficient Python-based implementation of the BM25 algorithm, which significantly improves the speed of lexical search by eagerly computing BM25 scores during indexing and storing them in sparse matrices. This approach achieves up to a 500x speedup compared to popular Python-based frameworks and highly optimized Java-based implementations used by commercial products. BM25S also reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) using a novel score shifting method. The implementation includes a simple but fast Python-based tokenizer that combines Scikit-Learn's text splitting, Elastic's stopword list, and an optional C-based Snowball stemmer. It also supports top-k retrieval with an average time complexity of $O(n)$ for selecting the top-$k$ most relevant documents. BM25S is designed to be minimally dependent, making it suitable for scenarios with limited storage and for use in web browsers via WebAssembly frameworks like Pyodide and Pyscript. The code is available at <https://github.com/xhlucabm25s>.

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

4 Jul 2024 | Xing Han Lu