Machine learning for medical imaging: automated disease diagnosis and prognosis
National Library Of Medicine
Investigators
Linked publications & trials
Abstract
Deep learning, a class of machine learning algorithms, has showed impressive results in several of our recent studies this year. In addition to its application to natural language processing, we have also seen its success in our medical image analysis such as processing chest X-rays, CT images, and various kinds of retinal images for autonomous disease diagnosis and prognosis. Together with our clinical collaborators on the NIH campus, we have previously text-mined over 100,000 radiology reports where our algorithm generated weak training labels to enable the development of advanced deep learning methods for automatically reading and classifying chest X-ray images. This work resulted in the release to the scientific community of ChestX-ray14 (https://nihcc.app.box.com/v/ChestXray-NIHCC), one of the largest publicly available chest x-ray datasets with over 10,000 downloads. We have also conducted research to assist in the screening of age-related macular degeneration (AMD), a leading cause of vision loss in Americans 60 and older. By leveraging cutting-edge deep learning techniques and repurposing big imaging data from a major AMD clinical trial, we previously developed a novel data-driven approach (DeepSeeNet) for autonomous AMD diagnosis that exceeded the performance of human ophthalmologists (retinal specialists in this case). Such a result highlights the potential of deep learning systems to assist in early disease detection and enhance the clinical decision-making processes. In 2025, we continued this line of research with an emphasis on systematic evaluations of AI workflows and external validation. Our 2025 study, published in JAMA Network Open, evaluated the integration of AI into clinical workflows for diagnosing age-related macular degeneration (AMD). Using 24 clinicians across 12 institutions and nearly 1,000 retinal images, we compared manual diagnosis with AI-assisted diagnosis. We found that AI support significantly improved diagnostic accuracy (F1 score rising from 37.7 to 45.5) and reduced average review time by about 10 seconds per case. While AI alone outperformed clinicians in some tasks, experts remained superior in detecting complex late-stage disease. An upgraded model, DeepSeeNet+, trained on additional data, showed stronger performance and generalizability across U.S. and external multiethnic cohorts. The findings highlight how AI can enhance accuracy and efficiency in clinical eye disease diagnosis, while stressing the need for external validation, trust, and transparency in real-world deployment. In radiology, we evaluated the performance of multimodal large language models on 95 RSNA âCase of the Dayâ questions from the 2024 annual meeting. Among the tested models, OpenAIâs o1 achieved the best results, correctly answering 59% of cases, outperforming Googleâs Gemini and other contemporaries. The findings highlight notable year-over-year progress in applying LLMs to radiology, particularly for case analysis and educational support. However, with accuracy still far from clinical reliability, we stress that these models should be seen as adjunct tools to support radiologists rather than as replacements for human expertise. We further introduced GPTRadScore, a novel automatic evaluation framework designed to assess how effectively multimodal large language models (MLLMs) interpret CT scan findings. Rather than relying on traditional language metrics like BLEU, METEOR, and ROUGE, GPTRadScore leverages GPT 4 to decompose model-generated descriptions into three clinically relevant componentsâbody part, location, and lesion typeâand assesses their accuracy against medically validated ground truths. When applied to models such as GPT 4V, Geminiâ¯Proâ¯Vision, LLaVA Med, and RadFM (including a domain fine-tuned version), GPTRadScore showed strong alignment with human clinician evaluationsâwith Pearson correlation coefficients ranging from 0.75 to 0.91, substantially outperforming conventional metrics. Notably, fine-tuning RadFM on specialized CT data improved its precise descriptive abilitiesâboosting location accuracy from ~3.4% to 12.8%, body part identification from 29% to 53%, and lesion type accuracy from 9% to 30%â¯arxiv.org. Overall, GPTRadScore provides a more clinically meaningful and scalable method for evaluating MLLMs in radiology, supporting better model development and validation.
View original record on NIH RePORTER →