Find set of rows in row-specific range with restriction at different levels

Issue

I have a dataset where each row is identified by a hospitalization id – physician id. Each row also contains information on the dates of admission and discharge and the hospital where it took place. A hospitalization may involve multiple physicians. A physician may work in multiple hospitals.

I have another dataset with information on each physician’s specialties (e.g., clinician, cardiologist). A physician may have multiple specialties.

I would like to know, for each hospitalization-physician row, the id of all other hospitalizations concluded in the 30 days prior to the start of that hospitalization which were performed in the same hospital by other physicians in the same specialty.

Partly using the solution posted in R (dplyr): find all rows in row-specific range *WITH RESTRICTION*, I managed to write a code that finds all hospitalizations performed by other physicians in the same hospital in the 30-day period prior to the start of the given hospitalization. For each row, I first found the list of all hospitalizations in the given hospital during the 30-day time period. Then, I found a list that only includes the hospitalizations where the ego physician was involved. Finally, I selected the elements in the first list which are not in the second list.

I would like to adjust the code to find the hospitalizations by other physicians who share at least one specialty with the ego physician. Ideally, I would like to change the first step of the code above to find the list of all hospitalizations within the ego physician’s set of specialty. Then I can use the rest of the code as it is to subtract from this list the hospitalizations involving the ego physician. The main difficulty here is the fact that a physician may have multiple specialties – otherwise it would just be a matter of including another variable in the filter function.

Below is the code I have now – it does not take into consideration the specialty of the ego physician.

df <- data.frame(hospitalization_id = c(1, 2, 3,
                                        1, 2, 3,
                                        4, 5, 
                                        6, 7, 8),
                 hospital_id = c("A", "A", "A", 
                                 "A", "A", "A", 
                                 "A", "A",
                                 "B", "B", "B"),
                 physician_id = c(1, 1, 1, 
                                  2, 2, 2,
                                  3, 3, 
                                  2, 2, 2),
                 date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-12", "2000-01-20",
                                        "2000-02-10", "2000-02-11", "2000-02-12")),
                 date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-18", "2000-01-22",
                                      "2000-02-11", "2000-02-14", "2000-02-17")))

df2 <- df %>%
  mutate(
    # Generates 30-day time interval before start of given hospitalization 
    date_range1 = date_start - 30,
    date_range2 = date_start - 1,
    # List of all hospitalizations in given hospital, in time interval
    hospid_all = pmap(list(date_range1, date_range2, hospital_id),
                      function(x, y, z) filter(df,
                                               date_end >= x & date_end <= y,
                                               hospital_id == z)$hospitalization_id),
    hospid_all = lapply(hospid_all, unique),
    # List of ego's hospitalizations in given hospital, in time interval
    hospid_ego = pmap(list(date_range1, date_range2, hospital_id, physician_id),
                      function(x, y, z, p) filter(df,
                                                  date_end >= x & date_end <= y,
                                                  hospital_id == z,
                                                  physician_id == p)$hospitalization_id),
    # List of peers' hospitalizations in given hospital, in time interval
    hospid_peer = future_map2(hospid_all, hospid_ego, ~ .x[!(.x %in% .y)])) %>%
  select(-starts_with('date_'), -hospid_all, -hospid_ego) %>% # only keep peers' list of hospitalization
  rename('ego'='physician_id')

df3 <- df2 %>%
  select(hospitalization_id, hospital_id, ego, hospid_peer) %>%
  unnest(hospid_peer, keep_empty = TRUE)

df4 <- df3 %>%
  left_join(select(df, hospitalization_id, physician_id), 
            by=c('hospid_peer'='hospitalization_id')) %>%
  rename(alter = physician_id)

The specialty of each physician is informed in this other df. In this example, Physician 2 shares specialty with both Physician 1 and Physician 3, but Physician 1 and Physician 3 do not have any specialty in common.

physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
                      specialty_code = c(100, 100, 200, 200))

Solution

You can create two helper functions, other_mds, and f. The first of these takes a physician id, and returns the physician ids of those with matching specialities. The second takes the hospital id, the physician id, and the start date (i.e. for a particular row in df, and return the list of hospitalizations that concluded within the prior 30 days, were in the same hospital, and were conducted by a physician with a matching speciality.

other_mds <- function(pid) {
  physician_spec[
    physician_id!=pid & specialty_code %in% physician_spec[physician_id==pid, specialty_code],
    physician_id]
}

f <- function(hid, pid, s) {
  other_phys = other_mds(pid)
  exclude_hosps = df[physician_id == pid, unique(hospitalization_id)]
  df[hospital_id == hid & 
       physician_id %in% other_phys &
       s>date_end &
       (s-date_end)<30 &
       !hospitalization_id %in% exclude_hosps,
     paste0(hospitalization_id, collapse=",")]
}

Now, we just apply the function f to each row

library(data.table)
setDT(df)
setDT(physician_spec)
df[, matches:=f(hospital_id, physician_id,date_start), 1:nrow(df)]

Output:

    hospitalization_id hospital_id physician_id date_start   date_end matches
                 <num>      <char>        <num>     <Date>     <Date>  <char>
 1:                  1           A            1 2000-01-01 2000-01-03        
 2:                  2           A            1 2000-01-12 2000-01-18        
 3:                  3           A            1 2000-01-20 2000-01-22        
 4:                  1           A            2 2000-01-01 2000-01-03        
 5:                  2           A            2 2000-01-12 2000-01-18        
 6:                  3           A            2 2000-01-20 2000-01-22       4
 7:                  4           A            3 2000-01-12 2000-01-18       1
 8:                  5           A            3 2000-01-20 2000-01-22     1,2
 9:                  6           B            2 2000-02-10 2000-02-11        
10:                  7           B            2 2000-02-11 2000-02-14        
11:                  8           B            2 2000-02-12 2000-02-17        

Update – returning a vector of matches, and then merge:

  1. Change f so that it returns a vector
f <- function(hid, pid, s) {
  other_phys = other_mds(pid)
  exclude_hosps = df[physician_id == pid, unique(hospitalization_id)]
  df[hospital_id == hid & 
       physician_id %in% other_phys &
       s>date_end &
       (s-date_end)<30 &
       !hospitalization_id %in% exclude_hosps]$hospitalization_id
}
  1. Now, when we run the function, we do this by hospitalization_id and physician_id, so that it returns a three column data.table (columns are the by columns and new column called match. Then merge this on the original df
df[, .(match = f(hospital_id, physician_id,date_start)), .(hospitalization_id, physician_id)][
  df, 
  on=.(hospitalization_id,physician_id)
]

Output:

    hospitalization_id physician_id match hospital_id date_start   date_end
                 <num>        <num> <num>      <char>     <Date>     <Date>
 1:                  1            1    NA           A 2000-01-01 2000-01-03
 2:                  2            1    NA           A 2000-01-12 2000-01-18
 3:                  3            1    NA           A 2000-01-20 2000-01-22
 4:                  1            2    NA           A 2000-01-01 2000-01-03
 5:                  2            2    NA           A 2000-01-12 2000-01-18
 6:                  3            2     4           A 2000-01-20 2000-01-22
 7:                  4            3     1           A 2000-01-12 2000-01-18
 8:                  5            3     1           A 2000-01-20 2000-01-22
 9:                  5            3     2           A 2000-01-20 2000-01-22
10:                  6            2    NA           B 2000-02-10 2000-02-11
11:                  7            2    NA           B 2000-02-11 2000-02-14
12:                  8            2    NA           B 2000-02-12 2000-02-17

Answered By – langtang

Answer Checked By – Senaida (AngularFixing Volunteer)

Leave a Reply

Your email address will not be published.