Understanding The Sound of Healthcare%3A Improving Medical Transcription ASR Accuracy with Large Language Models

This study explores the potential of Large Language Models (LLMs) to enhance the accuracy of Automatic Speech Recognition (ASR) systems in medical transcription. Using the PriMock57 dataset, which includes a diverse range of primary care consultations, the research focuses on improving general Word Error Rate (WER), Medical Concept WER (MC-WER), and speaker diarization accuracy. The study also assesses the role of LLM post-processing in improving semantic textual similarity, preserving the contextual integrity of clinical dialogues. The research employs two prompting techniques: zero-shot and Chain-of-Thought (CoT). Zero-shot prompting involves presenting LLMs with tasks and instructional prompts without prior task-specific examples, while CoT prompting provides intermediate reasoning steps and few-shot examples to guide the LLMs' reasoning process. The study compares the performance of LLMs with six ASR systems, including Google Cloud’s Medical Conversation model (GCMC), Chirp, Whisper 1, Amazon Transcribe Medical, Soniox, and Deepgram’s Nova 2. Key findings include: - LLMs, particularly through CoT prompting, improve diarization accuracy and achieve state-of-the-art performance. - LLMs enhance the accuracy of capturing medical concepts and improve the overall semantic coherence of transcribed dialogues. - The improvement in diarization accuracy extends to more accurate speaker identification and labeling. - LLMs are particularly effective in managing linguistic diversity, although current ASR systems' adaptability to diverse accents and dialects remains a challenge. - CoT prompting outperforms zero-shot prompting in terms of diarization and correction accuracy, with LLMs showing more reliable and consistent performance. - LLMs can significantly reduce medical concept errors, with Whisper 1 consistently exhibiting the lowest MC-WER. - LLMs also enhance semantic similarity, improving the overall quality of transcriptions. These findings highlight the dual role of LLMs in augmenting ASR outputs and independently excelling in transcription tasks, promising significant improvements in medical ASR systems and patient records in healthcare settings.This study explores the potential of Large Language Models (LLMs) to enhance the accuracy of Automatic Speech Recognition (ASR) systems in medical transcription. Using the PriMock57 dataset, which includes a diverse range of primary care consultations, the research focuses on improving general Word Error Rate (WER), Medical Concept WER (MC-WER), and speaker diarization accuracy. The study also assesses the role of LLM post-processing in improving semantic textual similarity, preserving the contextual integrity of clinical dialogues. The research employs two prompting techniques: zero-shot and Chain-of-Thought (CoT). Zero-shot prompting involves presenting LLMs with tasks and instructional prompts without prior task-specific examples, while CoT prompting provides intermediate reasoning steps and few-shot examples to guide the LLMs' reasoning process. The study compares the performance of LLMs with six ASR systems, including Google Cloud’s Medical Conversation model (GCMC), Chirp, Whisper 1, Amazon Transcribe Medical, Soniox, and Deepgram’s Nova 2. Key findings include: - LLMs, particularly through CoT prompting, improve diarization accuracy and achieve state-of-the-art performance. - LLMs enhance the accuracy of capturing medical concepts and improve the overall semantic coherence of transcribed dialogues. - The improvement in diarization accuracy extends to more accurate speaker identification and labeling. - LLMs are particularly effective in managing linguistic diversity, although current ASR systems' adaptability to diverse accents and dialects remains a challenge. - CoT prompting outperforms zero-shot prompting in terms of diarization and correction accuracy, with LLMs showing more reliable and consistent performance. - LLMs can significantly reduce medical concept errors, with Whisper 1 consistently exhibiting the lowest MC-WER. - LLMs also enhance semantic similarity, improving the overall quality of transcriptions. These findings highlight the dual role of LLMs in augmenting ASR outputs and independently excelling in transcription tasks, promising significant improvements in medical ASR systems and patient records in healthcare settings.

The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models

12 Feb 2024 | Ayo Adedeji, Sarita Joshi, and Brendan Doohan