Volume 13 / Number 7 / July, 1970 | BURTON H. BLOOM
This paper by Burton H. Bloom analyzes trade-offs among computational factors in hash coding, particularly focusing on the paradigm problem of testing a series of messages for membership in a given set. Two new hash-coding methods are introduced and compared with a conventional method. The new methods aim to reduce the space required for hash-coded information by tolerating a small fraction of errors, which is particularly useful in applications with large datasets where a core-resident hash area is infeasible.
The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember (reject time), and the allowable error frequency. The paper demonstrates that allowing a small number of false positives can significantly reduce the hash area size without increasing reject time. Two new methods are proposed: Method 1 uses smaller cells containing codes derived from messages, and Method 2 treats the hash area as individual bits, setting bits corresponding to hash addresses to 1.
The analysis shows that Method 2 is more efficient than Method 1, especially when the allowable error frequency is low. The paper also provides a detailed example of applying these methods to the problem of automatic hyphenation, where the new methods can reduce the hash area size from 2,000,000 bits to less than 300,000 bits, significantly reducing disk accesses.This paper by Burton H. Bloom analyzes trade-offs among computational factors in hash coding, particularly focusing on the paradigm problem of testing a series of messages for membership in a given set. Two new hash-coding methods are introduced and compared with a conventional method. The new methods aim to reduce the space required for hash-coded information by tolerating a small fraction of errors, which is particularly useful in applications with large datasets where a core-resident hash area is infeasible.
The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember (reject time), and the allowable error frequency. The paper demonstrates that allowing a small number of false positives can significantly reduce the hash area size without increasing reject time. Two new methods are proposed: Method 1 uses smaller cells containing codes derived from messages, and Method 2 treats the hash area as individual bits, setting bits corresponding to hash addresses to 1.
The analysis shows that Method 2 is more efficient than Method 1, especially when the allowable error frequency is low. The paper also provides a detailed example of applying these methods to the problem of automatic hyphenation, where the new methods can reduce the hash area size from 2,000,000 bits to less than 300,000 bits, significantly reducing disk accesses.