A large, publicly available data set of digital breast tomosynthesis (DBT) images could facilitate the development of artificial intelligence (AI) algorithms for breast cancer detection, according to researchers.
The data set could prove useful for training and validation of AI models, the researchers wrote in JAMA Network Open.1
However, a related editorial questioned the quality of the data set and suggested that it will not enable the development of strong models.2
To create the data set, researchers curated and annotated 22,032 reconstructed DBT volumes that belonged to 5610 studies from 5060 patients.
The studies were divided into 4 groups:
- 5129 (91.4%) studies with no abnormal findings
- 280 (5.0%) actionable studies that required additional imaging but no biopsy
- 112 (2.0%) studies with benign masses or architectural distortions that were biopsied based on DBT examination
- 89 (1.6%) cancer studies involving at least 1 cancerous mass or architectural distortion that was biopsied based on DBT examination.
To develop and test a deep learning model using these data, the researchers randomly split the data set into a training set (460 studies, 418 patients), a validation set (312 studies, 280 patients), and a test set (4838 studies, 4362 patients). There was no overlap of patients between the subsets.
In testing their model, the researchers used a breast-based free-response receiver operating characteristic curve, which “shows the sensitivity of the model in relation to the number of false-positive predictions placed in slice images, volumes, or cases.”
The sensitivity at 2 false positives per DBT volume was 67% (95% CI, 53%-80%) for test cases with cancer and 65% (95% CI, 56%-74%) for all cases in the test set.
The researchers concluded that their data set and deep learning model “could significantly advance the research on machine learning tools in breast cancer screening and medical imaging in general.” To that end, the group has made the data publicly available at the Cancer Imaging Archive.3
The authors of the related editorial applauded the researchers for making the data set public but highlighted limitations with the data, including a lack of follow-up, a “significant number of exclusions,” and the lack of cases of suspicious calcifications.2
“[D]ata sets made public must be of better quality and representative of a screening population to be truly useful,” the editorialists wrote. “Future models will otherwise risk being trained and tested on the wrong ground truth.”
Disclosures: Some study authors and editorialists declared conflicts of interest. Please see the original references for a full list of disclosures.
1. Buda M, Saha A, Walsh R, et al. A data set and deep learning algorithm for the detection of masses and architectural distortions in digital breast tomosynthesis images. JAMA Netw Open. Published online August 16, 2021. doi:10.1001/jamanetworkopen.2021.19100
2. Elmore JG, Lee CI. Data quality, data sharing, and moving artificial intelligence forward. JAMA Netw Open. Published online August 16, 2021. doi:10.1001/jamanetworkopen.2021.19345
3. Breast cancer screening—digital breast tomosynthesis (BCS-DBT). Cancer Imaging Archive. Published June 9, 2021. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=64685580