library(dplyr)
library(ggplot2)
library(magrittr)
library(DT)

options(stringsAsFactors = FALSE)

write.delim <- function(x, file, sep='\t', quote = FALSE, row.names=FALSE, na = '', ...) {
  write.table(x = x, file = file, sep=sep, quote=quote, row.names=row.names, na=na, ...)
}

Here, we parse the MEDI resource (1), which can be downloaded here. The resource is described (unless noted, blockquotes refer to material from the MEDI publication):

We processed four public medication resources, RxNorm, Side Effect Resource (SIDER) 2, MedlinePlus, and Wikipedia, to create MEDI. We applied natural language processing and ontology relationships to extract indications for prescribable, single-ingredient medication concepts and all ingredient concepts as defined by RxNorm. Indications were coded as Unified Medical Language System (UMLS) concepts and International Classification of Diseases, 9th edition (ICD9) codes. A total of 689 extracted indications were randomly selected for manual review for accuracy using dual-physician review. We identified a subset of medication–indication pairs that optimizes recall while maintaining high precision.

High precision subset (HPS)

The MEDI high-precision subset (MEDI-HPS) includes indications found within either RxNorm or at least two of the three other resources. MEDI-HPS contains 13,304 unique indication pairs regarding 2,136 medications. The mean±SD number of indications for each medication in MEDI-HPS is 6.22±6.09. The estimated precision of MEDI-HPS is 92%.

# Read the HPS dataset
hps.df <- file.path('download', 'MEDI_01212013_HPS_0.csv') %>%
  read.csv(blank.lines.skip = TRUE, colClasses = 'character') %>%
  dplyr::transmute(
    rxnorm_id = RXCUI_IN,
    drug_name = DRUG_DESC,
    disease_icd9 = ICD9,
    disease_name = INDICATION_DESCRIPTION,
    on_label = as.integer(POSSIBLE_LABEL_USE)
    )

# Calculate HPS counts
hps_count.df <- hps.df %>%
  dplyr::distinct(rxnorm_id, disease_icd9) %>%
  dplyr::summarize(
    resource = 'hps',
    medications = n_distinct(rxnorm_id),
    diseases = n_distinct(disease_icd9),
    indications = n())

# Display the HPS
hps.df %>% DT::datatable()

All Indications

MEDI contains 3,112 medications and 63,343 medication–indication pairs. Wikipedia was the largest resource, with 2608 medications and 34,911 pairs. For each resource, estimated precision and recall, respectively, were 94% and 20% for RxNorm, 75% and 33% for MedlinePlus, 67% and 31% for SIDER 2, and 56% and 51% for Wikipedia.

# Read the general dataset
umls.df <- file.path('download', 'MEDI_01212013_UMLS.csv.gz') %>%
  read.csv(blank.lines.skip = TRUE, colClasses = 'character') %>%
  dplyr::transmute(
    rxnorm_id = RXCUI_IN,
    drug_name = DRUG_DESC,
    disease_cui = UMLS_CUI,
    disease_icd9 = ICD9,
    disease_name = INDICATION_DESCRIPTION,
    resource = MENTIONEDBYRESOURCES,
    hps = as.integer(HIGHPRECISIONSUBSET),
    on_label = as.integer(POSSIBLE_LABEL_USE)
    )

# Display 1000 rows of the general dataset
umls.df %>% dplyr::sample_n(1000) %>% DT::datatable()

Resource-specific counts

Here we try to calculated the unique number of medications, diseases, and indications for each resource:

# Create a count table for each resource, hps subset, and general dataset.
count.df <- 
  # unique medications
  umls.df %>%
    dplyr::distinct(rxnorm_id, disease_icd9, resource) %>%
    dplyr::group_by(resource) %>%
    dplyr::summarize(medications = n_distinct(rxnorm_id)) %>%
  dplyr::left_join(
    # unique diseases
    y = umls.df %>%
      dplyr::distinct(rxnorm_id, disease_icd9, resource) %>%
      dplyr::group_by(resource) %>%
      dplyr::summarize(diseases = n_distinct(disease_icd9))) %>%
  dplyr::left_join(
    # unique indications
    y = umls.df %>%
      dplyr::distinct(rxnorm_id, disease_icd9, resource) %>%
      dplyr::group_by(resource) %>%
      dplyr::summarize(indications = n())) %>%
  dplyr::bind_rows(
    # union of all resources
    umls.df %>%
      dplyr::distinct(rxnorm_id, disease_icd9, resource) %>%
      dplyr::summarize(
        resource = 'all',
        medications = n_distinct(rxnorm_id),
        diseases = n_distinct(disease_icd9),
        indications = n()), 
    hps_count.df)

# Display the count table
count.df %>% knitr::kable()

resource	medications	diseases	indications
1	3091	2985	53615
2	1648	1075	6279
3	984	551	2497
4	447	222	952
all	3112	3009	63343
hps	2139	1345	13379

We compare our counts to the corresponding table from the manuscript:

Table 2: Number of unique medications, ICD9 codes, and indication pairs extracted from each resource

Resource Medications (% of total) ICD9 codes (% of total) Indication pairs (% of total)

RxNorm 1726 (56) 999 (33) 8040 (13)

SIDER 2 1554 (50) 1703 (57) 17702 (28)

MedlinePlus 1629 (52) 869 (29) 16581(26)

Wikipedia 2608 (84) 2624 (87) 34911 (55)

Union of all resources 3112 3009 63343

Resource	Medications (% of total)	ICD9 codes (% of total)	Indication pairs (% of total)
RxNorm	1726 (56)	999 (33)	8040 (13)
SIDER 2	1554 (50)	1703 (57)	17702 (28)
MedlinePlus	1629 (52)	869 (29)	16581(26)
Wikipedia	2608 (84)	2624 (87)	34911 (55)
Union of all resources	3112	3009	63343

Our analysis finds different resource-specific counts, but agrees with the combined counts. The high precision subset counts are very similar although different:

Thus, to optimize recall while maintaining reasonable precision, we defined the MEDI high-precision subset (MEDI-HPS) as the indications found within either at least two of the four resources or RxNorm. The current version of MEDI-HPS contains 13 304 unique indication pairs for 2136 medications. The mean number of indications for each medication is 6.22±6.09. The mode for each medication is 2, while the median is 4.

Indication-resource venn diagram

Next, we try to recreate the righthand panel of manuscript Figure 2, which shows the number of indications belonging to each combination of resources.

Figure 2: Weighted Venn diagram of the distribution of 3112 medications (left) and 63 343 indication pairs (right) within the four resources. Each border color represents a resource. Different colored areas represent medications–indications that were found within different combinations of resources. The area sizes surrounded by border color(s) are proportional to the number of medications–indications that were found within the corresponding resource(s).

Below, we construct a venn diagram of indication-resources. We find that all indications are only present in a single resource. This is either an error in our parsing or interpretation of MEDI_01212013_UMLS.csv.gz, or in the file itself.

# collapse umls.df so each row is a single indication (rxnorm_id-disease_icd9 pair)
medi.df <- umls.df %>%
  dplyr::group_by(rxnorm_id, disease_icd9) %>%
  dplyr::summarize(
    drug_name = drug_name[1],
    disease_cui = disease_cui[1],
    disease_name = disease_name[1],
    resource_1 = as.integer(any(resource == '1')),
    resource_2 = as.integer(any(resource == '2')),
    resource_3 = as.integer(any(resource == '3')),
    resource_4 = as.integer(any(resource == '4')),
    hps = as.integer(any(hps == 1)),
    on_label = as.integer(any(on_label == 1))
    ) %>%
  dplyr::ungroup()

# medi.df %$% xtabs(~ resource_1 + resource_2 + resource_3 + resource_4)

# venn diagram with the number of indications from each resource combination.
indication.venn <- medi.df %>%
  dplyr::select(resource_1:resource_4) %>%
  gplots::venn()

# table with the number of indications from each resource combination.
knitr::kable(indication.venn[, ])

	num	resource_1	resource_2	resource_3	resource_4
0000	0	0	0	0	0
0001	952	0	0	0	1
0010	2497	0	0	1	0
0011	0	0	0	1	1
0100	6279	0	1	0	0
0101	0	0	1	0	1
0110	0	0	1	1	0
0111	0	0	1	1	1
1000	53615	1	0	0	0
1001	0	1	0	0	1
1010	0	1	0	1	0
1011	0	1	0	1	1
1100	0	1	1	0	0
1101	0	1	1	0	1
1110	0	1	1	1	0
1111	0	1	1	1	1

Prevalence

A seperate analysis identified indication prevalence for the MEDI indications (2). MEDI appeared to have good medication coverage:

MEDI covered 97.3% of medications recorded in medical records. The “high precision subset” of MEDI covered 93.8% of recorded medications.

# Read the prevalence information for each indication
prev.df <- file.path('download', 'MEDI_wPrevalence_Published.csv.gz') %>%
  read.csv(blank.lines.skip = TRUE, colClasses = 'character') %>%
  dplyr::transmute(
    rxnorm_id = RXCUI_IN,
    drug_name = DRUG_DESC,
    disease_icd9 = ICD9,
    disease_name = ICD9_STR,
    hps = as.integer(HSP),
    prevalance = as.numeric(PREVALENCE)
    )

# collapse by indications (disease-drug pairs) and display 1000 rows
prev.df %>%
  dplyr::filter(! is.na(prevalance))  %>%
  dplyr::group_by(disease_icd9, rxnorm_id)  %>%
  dplyr::summarize(prevalance = mean(prevalance)) %>% 
  dplyr::sample_n(1000) %>% DT::datatable()

References

1. Wei W-Q, Cronin RM, Xu H, Lasko TA, Bastarache L, Denny JC (2013) Development and evaluation of an ensemble resource linking medications to their indications. Journal of the American Medical Informatics Association. doi:10.1136/amiajnl-2012-001431

2. Wei W-Q, Mosley JD, Bastarache L, Denny JC 2013. Validation and enhancement of a Computable Medication Indication Resource (MEDI) using a large practice-based dataset. In AMIA Annual Symposium Proceedings. Available: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900157/

To the extent possible under law, Daniel Himmelstein has waived all copyright and related or neighboring rights to Indications. This work is published from: United States.

Parsing MEDI

Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini

February 16, 2015