Vision-language foundation model for echocardiogram interpretation

Vision-language foundation model for echocardiogram interpretation

May 2024 | Matthew Christensen, Milos Vukadinovic, Neal Yuan & David Ouyang
EchoCLIP is a vision–language foundation model for echocardiogram interpretation, trained on 1,032,975 cardiac ultrasound videos and corresponding expert text. It can assess cardiac function with a mean absolute error of 7.1% in predicting left ventricular ejection fraction and identify implanted intracardiac devices with high accuracy. EchoCLIP-R, a long-context variant, can accurately identify unique patients across multiple videos and detect clinical transitions such as heart transplants and cardiac surgery. It also enables robust image-to-text search with high retrieval accuracy. EchoCLIP performs well on diverse benchmarks for cardiac image interpretation without explicit training for individual tasks. The model uses a ConvNeXt-Base image encoder and a Byte-Pair Encoding text tokenizer. It can identify implanted devices and assess cardiac function, with performance comparable to human experts. EchoCLIP-R improves retrieval capabilities by compressing echocardiography reports into fewer tokens, enabling efficient search and identification of clinical changes over time. The model's ability to identify patients across different studies and detect clinical changes highlights its potential for preliminary echocardiogram interpretation. EchoCLIP demonstrates strong performance in zero-shot tasks, achieving high accuracy in predicting cardiac function and identifying intracardiac devices. It also shows promise in detecting clinical differences between videos and identifying clinically relevant changes. The model's performance is robust across different datasets and clinical scenarios, suggesting its potential for broader application in cardiovascular imaging. EchoCLIP represents a significant step toward applying foundation models in cardiovascular imaging for preliminary interpretation of echocardiographic findings. The model's ability to learn from large datasets and generalize across tasks makes it a valuable tool for improving cardiac imaging analysis.EchoCLIP is a vision–language foundation model for echocardiogram interpretation, trained on 1,032,975 cardiac ultrasound videos and corresponding expert text. It can assess cardiac function with a mean absolute error of 7.1% in predicting left ventricular ejection fraction and identify implanted intracardiac devices with high accuracy. EchoCLIP-R, a long-context variant, can accurately identify unique patients across multiple videos and detect clinical transitions such as heart transplants and cardiac surgery. It also enables robust image-to-text search with high retrieval accuracy. EchoCLIP performs well on diverse benchmarks for cardiac image interpretation without explicit training for individual tasks. The model uses a ConvNeXt-Base image encoder and a Byte-Pair Encoding text tokenizer. It can identify implanted devices and assess cardiac function, with performance comparable to human experts. EchoCLIP-R improves retrieval capabilities by compressing echocardiography reports into fewer tokens, enabling efficient search and identification of clinical changes over time. The model's ability to identify patients across different studies and detect clinical changes highlights its potential for preliminary echocardiogram interpretation. EchoCLIP demonstrates strong performance in zero-shot tasks, achieving high accuracy in predicting cardiac function and identifying intracardiac devices. It also shows promise in detecting clinical differences between videos and identifying clinically relevant changes. The model's performance is robust across different datasets and clinical scenarios, suggesting its potential for broader application in cardiovascular imaging. EchoCLIP represents a significant step toward applying foundation models in cardiovascular imaging for preliminary interpretation of echocardiographic findings. The model's ability to learn from large datasets and generalize across tasks makes it a valuable tool for improving cardiac imaging analysis.
Reach us at info@study.space
[slides] Vision%E2%80%93language foundation model for echocardiogram interpretation | StudySpace