[slides] Efficient Similarity Search In Sequence Databases

This paper proposes an efficient method for similarity search in sequence databases using the Discrete Fourier Transform (DFT) and R*-trees. The key idea is to map time sequences to the frequency domain using DFT, where only the first few frequencies are significant. Parseval's theorem ensures that the Euclidean distance in the time domain is preserved in the frequency domain. By using only the first few DFT coefficients, the sequences are mapped to a lower-dimensional space, allowing efficient indexing with R*-trees. This approach reduces the dimensionality and avoids the "curse of dimensionality," while introducing a small number of false hits that can be filtered out in post-processing. The method is tested against sequential scanning and shows superior performance, especially with a small number of DFT coefficients (1-3). The performance gain increases with the number and length of sequences. The technique is applicable to various real-world sequences, such as stock price movements, exchange rates, and musical scores, which exhibit energy concentrated in the first few frequencies. The method is robust for medium-dimensional data and can be adapted to other similarity measures and distance-preserving transforms. The experiments demonstrate that the proposed F-index method achieves better performance as the volume of data increases, making it suitable for large databases. The paper also discusses future work, including extensions to higher-dimensional signals and other orthonormal transforms.This paper proposes an efficient method for similarity search in sequence databases using the Discrete Fourier Transform (DFT) and R*-trees. The key idea is to map time sequences to the frequency domain using DFT, where only the first few frequencies are significant. Parseval's theorem ensures that the Euclidean distance in the time domain is preserved in the frequency domain. By using only the first few DFT coefficients, the sequences are mapped to a lower-dimensional space, allowing efficient indexing with R*-trees. This approach reduces the dimensionality and avoids the "curse of dimensionality," while introducing a small number of false hits that can be filtered out in post-processing. The method is tested against sequential scanning and shows superior performance, especially with a small number of DFT coefficients (1-3). The performance gain increases with the number and length of sequences. The technique is applicable to various real-world sequences, such as stock price movements, exchange rates, and musical scores, which exhibit energy concentrated in the first few frequencies. The method is robust for medium-dimensional data and can be adapted to other similarity measures and distance-preserving transforms. The experiments demonstrate that the proposed F-index method achieves better performance as the volume of data increases, making it suitable for large databases. The paper also discusses future work, including extensions to higher-dimensional signals and other orthonormal transforms.

Efficient Similarity Search In Sequence Databases

| Rakesh Agrawal, Christos Faloutsos*, Arun Swami