This paper investigates whether self-supervised speech models (S3Ms) encode more phonetic or semantic properties. The authors curate a dataset of near-homophone and synonym word pairs and measure the similarities between S3M word representations. Their results show that S3M representations consistently exhibit more phonetic similarity than semantic similarity. They also question whether widely used intent classification datasets, such as Fluent Speech Commands and Snips Smartlights, are adequate for measuring semantic abilities. A simple baseline using only word identity outperforms S3M-based models on these datasets, suggesting that high scores on these datasets do not necessarily indicate semantic content.
The study analyzes various S3Ms, including wav2vec2.0-Base, -Large, HuBERT-Base, -Large, XLS-R-300M, and WavLM-Large. The results show that phonetic similarity is more prominent across all layers and models. Cross-lingual analysis also shows similar trends, with near-homophones being more similar than synonyms. The study also examines the effect of speaker variability and finds that results are consistent across different speakers.
The authors also test the effectiveness of S3Ms on intent classification tasks. They compare S3Ms with a simple baseline using a bag-of-words approach and find that the baseline performs better, suggesting that S3Ms may not necessarily encode semantic information. The study concludes that S3M representations are more phonetic than semantic and that high performance on intent classification tasks does not necessarily indicate semantic capabilities. The findings suggest that further research is needed to understand the semantic capabilities of S3Ms.This paper investigates whether self-supervised speech models (S3Ms) encode more phonetic or semantic properties. The authors curate a dataset of near-homophone and synonym word pairs and measure the similarities between S3M word representations. Their results show that S3M representations consistently exhibit more phonetic similarity than semantic similarity. They also question whether widely used intent classification datasets, such as Fluent Speech Commands and Snips Smartlights, are adequate for measuring semantic abilities. A simple baseline using only word identity outperforms S3M-based models on these datasets, suggesting that high scores on these datasets do not necessarily indicate semantic content.
The study analyzes various S3Ms, including wav2vec2.0-Base, -Large, HuBERT-Base, -Large, XLS-R-300M, and WavLM-Large. The results show that phonetic similarity is more prominent across all layers and models. Cross-lingual analysis also shows similar trends, with near-homophones being more similar than synonyms. The study also examines the effect of speaker variability and finds that results are consistent across different speakers.
The authors also test the effectiveness of S3Ms on intent classification tasks. They compare S3Ms with a simple baseline using a bag-of-words approach and find that the baseline performs better, suggesting that S3Ms may not necessarily encode semantic information. The study concludes that S3M representations are more phonetic than semantic and that high performance on intent classification tasks does not necessarily indicate semantic capabilities. The findings suggest that further research is needed to understand the semantic capabilities of S3Ms.