BM25S: Orders of magnitude faster lexical search via eager sparse scoring

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

4 Jul 2024 | Xing Han Lð
BM25S is an efficient Python implementation of BM25 that uses only NumPy and Scipy. It achieves up to 500x speedup compared to other Python-based frameworks by precomputing BM25 scores during indexing and storing them in sparse matrices. It also outperforms Java-based implementations used in commercial products. BM25S reproduces five BM25 variants by extending eager scoring with a novel score shifting method. The code is available at https://github.com/xhluca/bm25s. BM25S improves upon previous work by simplifying the implementation and generalizing to other BM25 variants. Unlike BM25-PT, it uses Scipy's sparse matrix implementation instead of PyTorch. It slices relevant indices and sums across the token dimension, avoiding matrix multiplications. It also includes a fast Python tokenizer that combines Scikit-Learn's text splitting, Elastic's stopword list, and optionally a C-based Snowball stemmer. This achieves better performance than subword tokenizers used by BM25-PT. It implements top-k retrieval with an average O(n) time complexity. The implementation follows the study by Kamphuis et al. (2020). BM25 scores are calculated using a formula that depends on term frequency and inverse document frequency. By reformulating the formula, BM25S can eagerly compute TF and IDF during indexing. It uses a sparse matrix in Compressed Sparse Column (CSC) format for efficient storage and computation. Tokenization uses a regular expression pattern for splitting text, with optional stemming and stopword removal. Top-k selection is done using an average O(n) time complexity with numpy or JAX. BM25S outperforms Rank-BM25 in throughput, achieving over 100x higher throughput in 10 out of 14 datasets. It also shows that including both stopwords and stemming modestly improves performance. BM25S is compared with other models, including Elasticsearch and Pyserini. It is efficient, fast, and has minimal dependencies, making it suitable for edge deployments and use in the browser via WebAssembly frameworks. It complements existing implementations and provides accurate, sparse BM25 scoring. However, it may not achieve the highest possible performance due to its focus on readability and extensibility. The experiments are conducted on free hardware for reproducibility.BM25S is an efficient Python implementation of BM25 that uses only NumPy and Scipy. It achieves up to 500x speedup compared to other Python-based frameworks by precomputing BM25 scores during indexing and storing them in sparse matrices. It also outperforms Java-based implementations used in commercial products. BM25S reproduces five BM25 variants by extending eager scoring with a novel score shifting method. The code is available at https://github.com/xhluca/bm25s. BM25S improves upon previous work by simplifying the implementation and generalizing to other BM25 variants. Unlike BM25-PT, it uses Scipy's sparse matrix implementation instead of PyTorch. It slices relevant indices and sums across the token dimension, avoiding matrix multiplications. It also includes a fast Python tokenizer that combines Scikit-Learn's text splitting, Elastic's stopword list, and optionally a C-based Snowball stemmer. This achieves better performance than subword tokenizers used by BM25-PT. It implements top-k retrieval with an average O(n) time complexity. The implementation follows the study by Kamphuis et al. (2020). BM25 scores are calculated using a formula that depends on term frequency and inverse document frequency. By reformulating the formula, BM25S can eagerly compute TF and IDF during indexing. It uses a sparse matrix in Compressed Sparse Column (CSC) format for efficient storage and computation. Tokenization uses a regular expression pattern for splitting text, with optional stemming and stopword removal. Top-k selection is done using an average O(n) time complexity with numpy or JAX. BM25S outperforms Rank-BM25 in throughput, achieving over 100x higher throughput in 10 out of 14 datasets. It also shows that including both stopwords and stemming modestly improves performance. BM25S is compared with other models, including Elasticsearch and Pyserini. It is efficient, fast, and has minimal dependencies, making it suitable for edge deployments and use in the browser via WebAssembly frameworks. It complements existing implementations and provides accurate, sparse BM25 scoring. However, it may not achieve the highest possible performance due to its focus on readability and extensibility. The experiments are conducted on free hardware for reproducibility.
Reach us at info@study.space
[slides and audio] BM25S%3A Orders of magnitude faster lexical search via eager sparse scoring