According to a study published in The Lancet Digital Health, artificial intelligence (AI) may be as effective as health professionals at diagnosing disease. The study represents the first systematic review and meta-analysis of the role of AI for diagnosing medical conditions. 

A press release reports that the small number of high quality studies of AI mean that understanding the “true power” of AI is difficult. Therefore, to provide more evidence, Alastair Denniston (University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK) and colleagues conducted a systematic review and meta-analysis of all studies comparing the performance of deep learning models and health professionals in detecting diseases from medical imaging published between January 2012 and June 2019. They also evaluated study design, reporting, and clinical value.

In total, 82 articles were included in the systematic review. Data were analysed for 69 articles that contained enough data to calculate test performance accurately. Pooled estimates from 25 articles that validated the results in an independent subset of images were included in the meta-analysis.

Analysis of data from 14 studies comparing the performance of deep learning with humans in the same sample found that, at best, deep learning algorithms can correctly detect disease in 87% of cases compared to 86% achieved by healthcare professionals. The ability to accurately exclude patients who do not have disease was also similar for deep learning algorithms (93% specificity) compared to healthcare professionals (91%).

However, Denniston et al note several limitations in the methodology and reporting of AI-diagnostic studies included in the analysis. For example, only four studies provided health professionals with additional clinical information (alongside the AI) that they would normally use to make a diagnosis in clinical practice. Additionally, few prospective studies were done in real clinical environments, and the authors say that to determine diagnostic accuracy requires high-quality comparisons in patients, not just datasets. Poor reporting was also common, with most studies not reporting missing data, which limits the conclusions that can be drawn.

Furthermore, only a few studies were of sufficient quality to be included in the analysis, and the authors caution that the true diagnostic power of the AI technique known as deep learning (the use of algorithms, big data, and computing power to emulate human learning and intelligence) remains uncertain because of the lack of studies that directly compare the performance of humans and machines, or that validate AI’s performance in real clinical environments.

Denniston comments: “We reviewed over 20,500 articles, but less than 1% of these were sufficiently robust in their design and reporting that independent reviewers had high confidence in their claims. What is more, only 25 studies validated the AI models externally (using medical images from a different population), and just 14 studies actually compared the performance of AI and health professionals using the same test sample. Within those handful of high-quality studies, we found that deep learning could indeed detect diseases ranging from cancers to eye diseases as accurately as health professionals. But it is important to note that AI did not substantially out-perform human diagnosis.”