According to researchers, machine-based algorithms were more accurate than human experts in assigning diagnoses to pigmented skin lesions through an analysis of dermatoscopic images. The findings from this study were published in Lancet Oncology.
Although automated devices based on datasets of dermatoscopic images of skin lesions have been approved by the US Food and Drug Administration (FDA) for the diagnosis of melanoma, their uptake by dermatologists has been limited, possibly due to their restriction to melanocytic lesions, and the need for expert-based lesion preselection. Furthermore, another limitation to the use of these devices may relate to the finding that approximately 50% of biopsied or excised pigmented lesions on heavily sun-damaged skin are nonmelanocytic.
A collection of more than 10,000 identified dermatoscopic images of all types of clinically significant pigmented lesions (ie, intraepithelial carcinoma including actinic keratoses and Bowen’s disease; basal cell carcinoma; benign keratinocytic lesions including solar lentigo, seborrheic keratosis, and lichen planus-like keratosis; dermatofibroma; melanoma; melanocytic nevus; and vascular lesions) was used in this study. This training set, in conjunction with a test set of 1511 unidentified images, was made available to machine-learning laboratories throughout the world in order to obtain estimates of the accuracy of machine-learning algorithms for the diagnosis of these skin lesions.
In addition, participation in a corresponding open, web-based, diagnostic study was offered to members of the International Dermoscopy Society. Respondents (ie, “human readers”) were asked to provide demographic characteristics, and to initially complete 4 screening tests (ie, involving selection of single responses to multiple choice questions following presentation of dermatoscopic images from the training set) in order to independently assess their diagnostic skill, and familiarize them with the diversity of the images. Study participants then completed at least 1 survey test, which utilized a subset of 30 images from the test set. One of the main aims of this study was to compare the diagnostic accuracy of the most advanced machine-based algorithms with the most experienced physicians.
“The primary task in our study was a multiclass problem with 7 disease categories, and not just the simple binary problem of melanoma versus nevi. Therefore, our diagnostic study could be considered closer to a real-life situation than other studies in this field,“ the study authors noted.
Results for 139 algorithms obtained from 77 machine-learning laboratories were compared with results for 511 human readers who completed at least 1 final rating test. Of the human readers, 55.4%, 23.1%, and 16.2% were board-certified dermatologists, dermatology residents, and general practitioners, respectively. Almost 50% of human readers were aged between 31 and 40 years.
A central finding of the study was that a comparison of the overall results for human readers with the overall results for machine-learning algorithms showed a mean of 2.01 more correct answers for the former (P <.0001).
Interestingly, human readers outperformed machine-based algorithms when the test questions involved more malignant lesions whereas the diagnostic accuracy of the machine-based algorithms was superior to human readers when the test questions were more heavily weighted with benign lesions.
When comparing responses from human readers with more than 10 years of experience with those from the top 3 machine-learning algorithms, the latter group still outperformed the former with a mean of 18.78 and 25.43 correct answers, respectively (P <.0001).
In commenting on these results, the study authors noted that “although machine-learning algorithms outperformed human experts in nearly every aspect, higher accuracy in a diagnostic study with digital images does not necessarily mean better clinical performance or patient management. The metrics used in this study treated all diagnoses equally. The algorithms were trained to optimize the mean sensitivity across all classes, and did not consider that it is more detrimental to mistake a malignant for a benign lesion than vice versa.”
“In future, it is probable that automated classifiers will be used under human guidance, rather than alone. Hence, it might be more appropriate to test the accuracy of automated classifying algorithms in the hand of human readers rather than to test classifiers and humans alone,” they concluded.
Tschandl P, Codella N, Akay BN, et al. Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. Lancet Oncol. doi: 10.1016/S1470-2045(19)30333-X