[slides] Self-Supervised Speech Representations are More Phonetic than Semantic

This paper investigates the linguistic properties encoded by self-supervised speech models (S3Ms) and finds that these models consistently exhibit more phonetic similarity than semantic similarity. The authors curate a dataset of near homophone and synonym word pairs to measure the similarities between S3M word representations. Their findings show that phonetically similar word pairs are significantly closer in S3M representations across all layers, while semantically similar word pairs are only slightly closer. This suggests that S3Ms are better at encoding phonetic information than semantic information. The authors also question the adequacy of widely used intent classification datasets, such as Fluent Speech Commands and Snips Smartgpt, for measuring semantic abilities. They develop a simple baseline that uses only word identity information, which outperforms S3M-based models on these datasets. This indicates that high scores on these datasets do not necessarily guarantee the presence of semantic content in S3M representations. The study concludes that S3Ms encode more phonetic than semantic content, and existing semantic benchmarks for S3Ms may not fully capture their semantic capabilities. The authors recommend further research to better understand the intra-differences in word representations and to develop more effective baselines for semantic tasks.This paper investigates the linguistic properties encoded by self-supervised speech models (S3Ms) and finds that these models consistently exhibit more phonetic similarity than semantic similarity. The authors curate a dataset of near homophone and synonym word pairs to measure the similarities between S3M word representations. Their findings show that phonetically similar word pairs are significantly closer in S3M representations across all layers, while semantically similar word pairs are only slightly closer. This suggests that S3Ms are better at encoding phonetic information than semantic information. The authors also question the adequacy of widely used intent classification datasets, such as Fluent Speech Commands and Snips Smartgpt, for measuring semantic abilities. They develop a simple baseline that uses only word identity information, which outperforms S3M-based models on these datasets. This indicates that high scores on these datasets do not necessarily guarantee the presence of semantic content in S3M representations. The study concludes that S3Ms encode more phonetic than semantic content, and existing semantic benchmarks for S3Ms may not fully capture their semantic capabilities. The authors recommend further research to better understand the intra-differences in word representations and to develop more effective baselines for semantic tasks.

Self-Supervised Speech Representations are More Phonetic than Semantic

12 Jun 2024 | Kwanghee Choi1, Ankita Pasad2, Tomohiko Nakamura3, Satoru Fukayama3, Karen Livescu2, Shinji Watanabe1