Data and Methods
Data sets. We extracted the patient–disease pairs from the adverse event reports for comorbidity mining. The adverse event reports contain records of 3,354,043 patients. Among all patients, 2,213,399 (66%) and 3,153,795 (94%) have their age and gender information available. Figure 1(a,b) shows the distributions of age and gender. Different from the Medicare claims, which only contain patients of age 65 years or older, the adverse event reports have patients aged from one day to hundreds of years.
With both the disease and demographics data for millions of patients, we were able to study the potential effects of age and gender on the change of disease comorbidity patterns. For comorbidity extraction, we stratified patients into five groups based on their ages (Fig. 1a) and two groups based on their genders (Fig. 1b).
(Click on above figure to enlarge.)
The adverse event reporting system represents patient diseases by the indications of drugs that patients take. These indication terms include not only disorders, but also treatment procedures, such as surgery; common symptoms, such as pain; and ill-defined events, such as un-evaluable events.
We mapped the indication terms to the concept unique identifiers (CUIs) in the Unified Medical Language System (UMLS), combined the synonyms into unique concepts, and extracted the concepts with semantic types of human disorders.
From the 10,122 indication terms, we extracted 8,224 disorder concepts, including terms of the 11 semantic types listed in Figure 1(c). Among the disorder concepts, we found 1,138 different cancers, which have the semantic type of neoplastic process (T191).
Extract stratified cancer comorbidities. Using the patient–disease data in each stratified group, we mined cancer comorbidities by the following three steps (Fig. 2). First, we applied an association rule mining algorithm on patient–disease pairs, and mined strong co-occurrence patterns among all possible disease combinations. Then, we constructed a comorbidity network using the resulting patterns.
Finally, to extract comorbidities for cancers, we initiated a random walk on the network from a set of interested cancer nodes, and ranked the non-cancer diseases with the probabilities of being reached by the random walk. After repeating the three steps for each patient group, we traced the changes of cancer comorbidities across different age or gender groups. The following subsections describe each step in detail.
Mine comorbidity patterns. Most previous studies used relative risk and ϕ-correlation to mine comorbidity patterns. However, both these measures are intrinsically biased toward rare diseases and exclusively considered pairwise relationships. We applied an association rule mining approach, which flexibly detects strong co-occurrence relationships not only between disease pairs, but also among multiple diseases.
Because of the large number of patients and diseases in the data, we implemented the association rule mining with the frequent pattern growth algorithm25 based on the Weka java package26 to efficiently search for possible association patterns. This algorithm has been successfully applied in biomedical domain to extract drug adverse events.27
The result of the algorithm is a list of patterns between two sets of diseases, represented in the form X⇒Y. For example, [anxiety, amnesia]⇒[depression] indicates that when patients have anxiety and amnesia, they are also likely to have depression.
However, although each pattern is directed with an arrow, it does not mean causations between diseases, but only represents co-occurrences. To avoid confusion, we ignored the directions of the patterns, and considered all diseases in set X and Y associated.
The frequent pattern growth algorithm requires a few parameters: the minimum support was set to 5, which means at least five patients should have all the diseases in each pattern at the same time; the maximum number of diseases in each pattern was limited to three; and confidence was chosen to measure and rank the patterns.
The confidence score of pattern X⇒Y defined in (1) estimates the probability that Y appears given X. The numerator represents the number of patients who have diseases in both set X and Y, and the denominator is the number of patients who have diseases in set X. We extracted all patterns with confidences over 10%.
Construct comorbidity network. For each stratified patient group, we constructed a disease comorbidity network to model the results of association rule mining. Given a pattern X⇒Y, we collected all diseases in the set X and Y, assumed that they associate with each other, and connected each pair of diseases in X and Y with an edge.
After transforming all pattern rules into connected disease nodes, we constructed an unweighted and undirected comorbidity network. The network offers a global view of comorbidity relationships among all diseases.
Rank cancer comorbidities. Given a set of any interested cancer nodes as the “seeds,” we applied the random walk with restart algorithm to estimate relevance scores for each node to the seeds. The random walk algorithm takes network structure into account without overemphasizing the connections through highly connected nodes.
Assume p0 is a vector of initial scores for comorbidity candidates, and pk is the vector consisting of the relevance score of each node at step k, the algorithm iteratively updates pk by (2), where M is the adjacency matrix of the comorbidity network with the normalized columns, and γ is the probability that the random walker restarts from the seed nodes at each step.
The random walk algorithm generates a list of cancer comorbidities ranked by their relevance scores for each stratified patient group.