4

I am looking for an efficient way to compute the union and intersection of time intervals (start–stop format) by group (id), while keeping the covariates associated with each interval.

Each patient (id) has:

  • a set of post-intervention periods (A)
  • a set of drug treatment periods (B)

I want to compute, for each patient:

  • the intersection (A ∩ B): periods where both are active
  • the union (A ∪ B): all time segments covered by either A or B while keeping the relevant covariates from each dataset.

It seems that data.table::foverlaps() could help, but I can’t figure out how to use it for grouped interval manipulation.

Example:

library(data.table)

A <- data.table(
  start = as.Date(c("2022-10-16", "2023-06-16")),
  stop  = as.Date(c("2023-03-12", "2023-10-12")),
  covar_A = c("P-int 1", "P-int 2")
)

B <- data.table(
  start = as.Date(c("2022-10-16", "2023-08-16")),
  stop  = as.Date(c("2023-01-12", "2023-12-12")),
  covar_B = c("Drug 1", "Drug 2")
)

Expected output:

Union (A ∪ B)

Start Stop covar_A covar_B
2022-10-16 2023-01-12 P-int 1 Drug 1
2023-01-12 2023-03-12 P-int 1 NA
2023-03-12 2023-06-16 NA NA
2023-06-16 2023-08-16 P-int 2 NA
2023-08-16 2023-10-12 P-int 2 Drug 2
2023-10-12 2023-12-12 NA Drug 2

Intersection (A ∩ B)

Start Stop covar_A covar_B
2022-10-16 2023-01-12 P-int 1 Drug 1
2023-08-16 2023-10-12 P-int 2 Drug 2

Visualisation of the expected output:

Plot showing horizontal bars with dates on the x-axis and intervals on the y-axis

How can I compute these interval unions and intersections per patient (id) while keeping the covariate information from both datasets?

Ideally, the solution should scale to datasets with several thousand patients and tens of thousands of intervals.

5
  • 2
    For the intersection cases: is it correct to assume that you are looking for exactly the same time intervals from A and B? In that case, it would be a simple left_join with both start and stop as "by" variables. For the union case, can you please elaborate a bit more, e.g. 2023-01-12 does not appear as a start date in any of your data sets, yet you show it in your expected output. Commented Oct 26 at 14:59
  • Thanks for your reply! By intersection, I mean the overlapping parts of the intervals from A and B — not the exact matches of start and stop dates. By union, I mean all distinct time segments defined by any start or stop boundary from either A or B. Here left_join() didn't produce the expected result. In fact, the operation I’m looking for should be symmetric between A and B, just like set unions and intersections in mathematics wouldn't produce the expected result. Commented Oct 27 at 15:15
  • Another: you are referring to patients and respective ids. However, they are not part of your data. Can you please update? I assume, rows 1 of your data A and B is patient 1, row 2 refers to patient 2? Commented Oct 27 at 16:25
  • Perhaps, b = sort(unique(c(A$start, A$stop, B$start, B$stop))); U = data.table(start=head(b, -1), stop=tail(b, -1)); U[A, on=.(start < stop, stop > start), covar_A := i.covar_A, by=.EACHI][B, on=.(start < stop, stop > start), covar_B := i.covar_B, by=.EACHI][]? And intersection is just filtering for non-NA? Commented Oct 27 at 20:22
  • 2
    Alex, any progress on your question? You have several suggested answers, it's not clear if you have resolution, if you still have questions, or otherwise. Please don't abandon a question with answers without accepting one. Commented Nov 6 at 12:47

5 Answers 5

2

Note that the solutions in this answer assume data grouped by a patient ID, as mentioned in the question (but not in the example given).

The key is your comment:

By union, I mean all distinct time segments defined by any start or stop boundary from either A or B.

i.e. every date in the four corresponding columns starts a new spell in the "union" (a new {A state, B state} pair), and there will be 2*nrow(A) + 2*nrow(B) such spells (including a censored post-stop spell at the end of each patient's history). A simple, scalable, extremely fast solution is therefore: stack |> melt |> sort |> fill A,B states (all grouped by, or ideally vectorised over, patients).

(On their own, the "intersections" between A and B can be done even more simply as an inner overlapping-range join, selecting the later of the two starts and earlier of the two stops at each joined row.)

Here are timings on data at the scale described in the question (A and B each with 100K rows, 10K distinct patients). The union takes about 1/8 of a second and the intersections about 1/30.

library(data.table)

# Full-scale data: A and B 100K rows each, 10K distinct patients
args <- list(n_i=1e4L, n_t=10)
idx_t <- rep(1:args$n_t, args$n_i); 
set.seed(1)
A <- do.call(gen_spells, args)[, covar_A := paste("Pt int", idx_t)][]
B <- do.call(gen_spells, args)[, covar_B := paste("Drug", idx_t)][]
> print(bm, unit="seconds",signif=3)
Unit: seconds
                 expr    min     lq   mean median     uq   max neval
     union_fill(A, B) 0.3960 0.4170 0.4360 0.4290 0.4430 0.558   100
 union_fill_zoo(A, B) 0.1040 0.1110 0.1380 0.1180 0.1250 0.235   100
     union_join(A, B) 0.1310 0.1440 0.1570 0.1470 0.1540 0.265   100
  intersections(A, B) 0.0273 0.0293 0.0369 0.0314 0.0378 0.149   100

Code for gen_spells() is at the bottom of this post.

I spent more time than was healthy on the details of the fill step. For each start in A we need to fill covar_A forward to one row before its corresponding stop, and fill with NA from that stop to the next start (and the same with covar_B, while respecting the boundaries between patient histories). The variants above are:

  • union_fill(): replace with covar[1L] or NA, grouped by the appropriate runs
  • union_fill_zoo(): replace covar at each stop with a sentinel string and forward fill (using zoo::na.locf(), since data.table::na.fill() can't fill character vectors)
  • union_join(): don't carry the covars into the melt and just use a join to populate them, instead of a fill

I think union_join() is the most elegant (and union_fill_zoo() the least!). I didn't realise until I looked at timings for other solutions (see below) but @marcguery's answer does the same thing.

NB it would be possible to extend these solutions efficiently to more than two tables as some of the other solutions do.

Solution code

I've used column names "from" and "to" in the output. I've given union_fill() a slightly different output: it keeps one transition per row (so has zero-length spells when there are simultaneous events in A and B), and tags the transition. union_join() doesn't have this (it actually can't), but you probably don't want it anyway.

install.packages('fjoin', repos = c('https://trobx.r-universe.dev'))
# A CRAN-ready general data frame join pkg released last week
# It autogenerates and runs extended data.table code (and does some no-copy
# object-handling around the edges to accept/return non-data.tables)
union_fill <- function(A, B) {
  # one transition per row (hence zero-length spells if simultaneous)

  library(data.table)

  for (dt in list(A,B)) setkeyv(dt, c("id","start","stop"))

  # stack A and B
  X <- rbind(A=A, B=B, fill=TRUE, idcol="which_table")

  # melt starts and stops
  X <- melt(X, id.vars = c("id", "covar_A", "covar_B", "which_table"),
            measure.vars = c("start", "stop"),
            value.name = "from",
            variable.name = "which_column", variable.factor = FALSE)

  # sort
  setkey(X, id, from)

  # fill A state, fill B state
  X[, is_stop := which_column=="stop"]
  X[, covar_A := if(is_stop[1L]) NA_character_ else covar_A[1L], by=.(id, cumsum(which_table=="A"))]
  X[, covar_B := if(is_stop[1L]) NA_character_ else covar_B[1L], by=.(id, cumsum(which_table=="B"))]
  X[, is_stop := NULL]

  # add "to" column, drop final post-stop spell per history
  X[, to := fifelse(id == shift(id, type="lead"), shift(from, type="lead"), NA)]
  X <- X[id==shift(id, type="lead")]
  setcolorder(X, c("id","from","to","covar_A","covar_B","which_table","which_column"))[]
}
union_join <- function(A, B) {
  
  library(data.table)
  library(fjoin)
  
  for (dt in list(A,B)) setkeyv(dt, c("id","start","stop"))
  
  # stack A and B key cols (no covars)
  X <- rbind(A[,.SD,.SDcols=key(A)], B[,.SD,.SDcols=key(B)])

  # melt starts and stops
  X <- melt(X, id.vars = "id",
            measure.vars = c("start", "stop"),
            value.name = "from",
            variable.factor = FALSE) |>
       _[, variable := NULL]

  # sort, deduplicate, add "to" column, drop final post-stop spell per history
  X <- unique(setkey(X))
  X[, to := fifelse(id == shift(id, type="lead"), shift(from, type="lead"), NA)]
  X <- X[id==shift(id,type="lead")]

  # populate the states using M:1 open-overlap joins
  setkey(X, id, from, to)
  X <- fjoin_left(X, A, on=c("id", "from < stop", "to > start"), select = "covar_A")
  X <- fjoin_left(X, B, on=c("id", "from < stop", "to > start"), select = c("covar_A", "covar_B"))
  X[, c("R.start","R.stop") := NULL][]
}
intersections <- function(A, B) {
  
  library(data.table)
  library(fjoin)

  # inner overlap join, then take the later start and earlier stop 
  X <- fjoin_inner(A, B, on=c("id", "start <= stop", "stop >= start"))
  X[, `:=`(from=pmax(start, R.start), to=pmin(stop, R.stop))]
  
  # cosmetics
  setcolorder(X, c("id","from","to","covar_A","covar_B"))[]
}

Demo on toy data

For realism/checking, this toy data has:

  • a Patient ID column
  • Extended overlaps on each side
  • An empty spell in the middle
  • A tied stop and a tied start
  • A Patient 2 to demonstrate that things work across the boundary between patients
A_toy <- fread("
id      start       stop   covar_A
 1 2025-01-01 2025-01-27 'Pt int 1'
 1 2025-02-11 2025-02-17 'Pt int 2'
 1 2025-02-21 2025-03-01 'Pt int 3'
 1 2025-03-06 2025-03-13 'Pt int 4'
", quote="'")

B_toy <- fread("
id      start       stop covar_B
 1 2025-01-05 2025-01-10 'Drug 1'
 1 2025-01-25 2025-01-30 'Drug 2'
 1 2025-02-04 2025-03-01 'Drug 3'
 1 2025-03-06 2025-03-11 'Drug 4'
 2 2025-01-03 2025-01-10 'Drug 1'
", quote="'")

setkey(A_toy, id, start, stop)
setkey(B_toy, id, start, stop)

Checking the "union" solutions:

# compare union (full natural join with indicator)
fjoin_full(
  union_fill(A_toy,B_toy),
  union_join(A_toy,B_toy),
  on=NA,         # natural join
  match.na=TRUE, # fjoin is NA-safe by default but we *want* NA matches here
  indicate=TRUE  # Stata-style indicator (1 left only, 2 right only, 3 joined)
)

    .join    id       from         to  covar_A covar_B which_table which_column
    <int> <int>     <IDat>     <IDat>   <char>  <char>      <char>       <char>
 1:     3     1 2025-01-01 2025-01-05 Pt int 1    <NA>           A        start
 2:     3     1 2025-01-05 2025-01-10 Pt int 1  Drug 1           B        start
 3:     3     1 2025-01-10 2025-01-25 Pt int 1    <NA>           B         stop
 4:     3     1 2025-01-25 2025-01-27 Pt int 1  Drug 2           B        start
 5:     3     1 2025-01-27 2025-01-30     <NA>  Drug 2           A         stop
 6:     3     1 2025-01-30 2025-02-04     <NA>    <NA>           B         stop
 7:     3     1 2025-02-04 2025-02-11     <NA>  Drug 3           B        start
 8:     3     1 2025-02-11 2025-02-17 Pt int 2  Drug 3           A        start
 9:     3     1 2025-02-17 2025-02-21     <NA>  Drug 3           A         stop
10:     3     1 2025-02-21 2025-03-01 Pt int 3  Drug 3           A        start
11:     1     1 2025-03-01 2025-03-01     <NA>  Drug 3           A         stop
12:     3     1 2025-03-01 2025-03-06     <NA>    <NA>           B         stop
13:     1     1 2025-03-06 2025-03-06 Pt int 4    <NA>           A        start
14:     3     1 2025-03-06 2025-03-11 Pt int 4  Drug 4           B        start
15:     3     1 2025-03-11 2025-03-13 Pt int 4    <NA>           B         stop
16:     3     2 2025-01-03 2025-01-10     <NA>  Drug 1           B        start

Note that union_fill() has extra columns which_table and which_column and two extra rows (11 and 13).

"Intersections":

# check intersections
fjoin_full(
  intersections(A_toy,B_toy),
  union_join(A_toy,B_toy)[!(is.na(covar_A) | is.na(covar_B))],
  on=NA,
  select.y = "", # leave out the "which_" cols
  indicate=TRUE
)

   .join    id       from         to  covar_A covar_B      start       stop    R.start     R.stop
   <int> <int>     <IDat>     <IDat>   <char>  <char>     <IDat>     <IDat>     <IDat>     <IDat>
1:     3     1 2025-01-05 2025-01-10 Pt int 1  Drug 1 2025-01-01 2025-01-27 2025-01-05 2025-01-10
2:     3     1 2025-01-25 2025-01-27 Pt int 1  Drug 2 2025-01-01 2025-01-27 2025-01-25 2025-01-30
3:     3     1 2025-02-11 2025-02-17 Pt int 2  Drug 3 2025-02-11 2025-02-17 2025-02-04 2025-03-01
4:     3     1 2025-02-21 2025-03-01 Pt int 3  Drug 3 2025-02-21 2025-03-01 2025-02-04 2025-03-01
5:     3     1 2025-03-06 2025-03-11 Pt int 4  Drug 4 2025-03-06 2025-03-13 2025-03-06 2025-03-11

Other answers

The other answers all also give the same correct output on the toy data (@Wimpel's doesn't include the empty spells (which they could easily add), and defines contiguous spells like 1-4,5-7).

None of them incorporate multiple patients, but a dirty method of doing this is to wrap them in by.

# 1000 histories of length 10
args <- list(n_i=1000, n_t=10)
idx_t <- rep(1:args$n_t, args$n_i); set.seed(1)
A <- do.call(gen_spells, args)[, covar_A := paste("Pt int", idx_t)][]
B <- do.call(gen_spells, args)[, covar_B := paste("Drug", idx_t)][]
for (x in list(A2 <- copy(A), B2 <- copy(B))) x[, names(.SD) := lapply(.SD, as.Date), .SDcols=c("start","stop")]

do_by_id <- function(f,x,y) x[, f(.SD,y[id==.BY,.SD,.SDcols=-"id"]), by=id]

> print(rbind(bm3a,bm3b), unit="seconds", signif=3)
Unit: seconds
                        expr      min       lq     mean   median       uq      max neval
            union_fill(A, B)   0.0526   0.0536   0.0583   0.0565   0.0604   0.0724    10
            union_join(A, B)   0.0307   0.0335   0.0350   0.0356   0.0370   0.0376    10
 do_by_id(marcguery, A2, B2)  12.2000  13.1000  13.4000  13.5000  13.8000  14.3000    10
      do_by_id(Wimpel, A, B)  10.1000  11.2000  11.6000  11.8000  12.2000  12.2000    10
     do_by_id(r2evans, A, B) 112.0000 112.0000 112.0000 112.0000 112.0000 112.0000     1
     do_by_id(deschen, A, B) 168.0000 168.0000 168.0000 168.0000 168.0000 168.0000     1

Please note well that the timings of the other solutions would be far faster if they were specifically adapted to cope with multiple patients, rather than being bunged into a general split-apply-combine setup. But even doing it this way, the worst case is ~30 mins on the full-scale data (10K such histories), so all of them are feasible on the task at hand, which is not time-sensitive (@deschen's point).

> all.equal(do_by_id(deschen,A,B), union_join(A,B),
            use.names=FALSE, check.attributes=FALSE)
TRUE

Overlap joins

Part of your question refers to data.table's foverlaps() function. This was intended for asymmetric data typical in genomics. However, in my experience/playing around (which is not genomics) it is very often outperformed by an inequality join using standard data.table join machinery. (Note that inequality joins in data.table came after foverlaps().)

?foverlaps explains how the different overlap types translate to inequalities. It takes seeing, but A and B overlap iff start_A <= stop_B & stop_A >= start_B (where starting and stopping on the same day counts as overlapping, hence the weak inequalities). I find it easier to see this as !(start_A > stop_B | start_B > stop_A), i.e. they don't not overlap. But that doesn't work as a join predicate in data.table!

I've used a data.table inequality join in intersections(), which is several times faster than foverlaps(), though they are both basically instantaneous:

# intersections - overlap join step only (full-scale data)
data.table_inequality <- \(A,B) fjoin_inner(A,B,on=c("id","start <= stop","stop >= start"))
> print(bm_foverlaps, unit="seconds",signif=3)
Unit: seconds
                        expr    min     lq   mean median     uq   max neval
 data.table_inequality(A, B) 0.0209 0.0221 0.0265 0.0228 0.0234 0.193   100
             foverlaps(A, B) 0.0561 0.0586 0.0762 0.0631 0.0787 0.254   100

foverlaps() requires keyed data and as a result it is very short and sweet - you typically don't need to specify the join columns. But there are times when you can't use it, e.g. because you need a strict inequality. That's actually the case in union_join(). In that case you have to patch things up post hoc (as in @marcguery's code), or write a data.table inequality join directly, which is notoriously painful because of its garbling of join column outputs (which is logical and systematic, but inconvenient). fjoin removes the pain because it autogenerates the data.table code for you. You can view this code instead of executing it, with do=FALSE. Going back to intersections() to illustrate:

> fjoin::fjoin_inner(
+   A, B, on=c("id", "start<=stop", "stop>=start"),
+   select=c("covar_A","covar_B"),
+   do=FALSE
+ )
.DT : y = B
.i  : x = A
Join: setDT(.DT[.i, on = c("id", "stop >= start", "start <= stop"), nomatch = NULL,
data.frame(id = i.id, start = i.start, stop = i.stop, R.start = x.start, R.stop = x.stop,
covar_A, covar_B)])[]

(I've also added a select arg here in case you have irrelevant columns in your real-life A and B.) You could now swipe that data.table code and edit the j-expression to make the whole intersections solution a true one-liner, partly "ghostwritten" by fjoin:

setDT(
  B[A,
    on = c("id", "stop >= start", "start <= stop"),
    nomatch = NULL,
    data.frame(
      id = i.id,
      from = pmax(i.start, x.start), to = pmax(i.stop, x.stop),
      covar_A, covar_B,
      start_A = i.start, stop_A = i.stop,
      start_B = x.start, stop_B = x.stop
    )]
)[]

You might notice that fjoin avoids j=list(). This is because that deep-copies the vectors, which is unnecessary when they are fresh outputs of a join. But enough about fjoin, except to say that this is its first outing, and if you do choose to use it, you will be its patient zero!

(Code for artificial data)

gen_spells <- function(
  n_i=1e4L,    # patients
  n_t=10L,     # spells per patient
  off_max=60L, # max gap between spells
  on_max=180L, # max duration of spells
  offset=0L,   # offset from base
  base=as.IDate("2020-01-01")
) {
  dt <- data.table(
    id=rep(1:n_i,each=n_t),
    off=sample(off_max, n_i*n_t, TRUE),
    on=sample(on_max, n_i*n_t, TRUE)
  )
  dt[, stop:=base-1L+cumsum(off+on), keyby=id][, start:=stop-on]
  kcols <- c("id","start","stop")
  setcolorder(dt[, c("off", "on") := NULL], kcols)
  setkeyv(dt, kcols)[]
}
Sign up to request clarification or add additional context in comments.

5 Comments

+1 for patient zero, you've been warned. And for those unfamiliar, like me, could you expand on freshness or conversely when something is stale.
Thank you for this @Chris. Although it is more idiomatic, .() in j calls as.data.table.list, which always deep-copies the vectors. But this copy is redundant after a join because the vectors are freshly allocated and hence unshared (at least, they are in data.table -- it seems they aren't always in collapse as I discovered last week). Using data.frame() in j bypasses that copy under ordinary copy-on-write rules, and it can be followed if necessary by setDT(), as_tibble(), or st_sf() (plus a careful refresh of attributes).
fjoin does something similar on the way in when one or both inputs are not data.tables: since the input columns only need to be read, fjoin can get away with "shallow-casting" to data.table using e.g. setDT(unclass(df)), i.e. a new object that shares the existing columns. Btw I owe a lot of this understanding to an extremely helpful response from the data.table devs here: github.com/Rdatatable/data.table/issues/6843.
Interesting and useful enough to save for later, thanks. Will be interested to see you advocating additional use cases as they come up.
Thank you very much for this detailed and clear answer. I hadn’t expected such depth, it took me some time to digest everything! I’m glad to be patient zero, and I’ll do my best to help propagate the "fjoin" epidemic.
2

Using data.table and dplyr:

library(data.table)

#### 1. Generate all possible date ranges ####
A_B <- rbind(A[,c(1,2)], B[,c(1,2)])
all_dates <- data.table(start = sort(unique(c(A_B$start, A_B$stop))),
                       stop = dplyr::lead(sort(unique(c(A_B$start, A_B$stop))), 1))
all_dates <- na.omit(all_dates)

####  2. Overlap each data table with the full set of date ranges ####
setkey(A, start, stop)
setkey(B, start, stop)
setkey(all_dates, start, stop)

overlaps_A <- foverlaps(A, all_dates, type = "any")
# Special case when 2 intervals (say x and y) should not 
# overlap when 'start_x == stop_y' or  'start_y == stop_x'
overlaps_A <- overlaps_A[overlaps_A$i.start <  overlaps_A$stop,]
overlaps_A <- overlaps_A[overlaps_A$start <  overlaps_A$i.stop,]

overlaps_B <- foverlaps(B, all_dates, type = "any")
# Special case when 2 intervals (say x and y) should not 
# overlap when 'start_x == stop_y' or  'start_y == stop_x'
overlaps_B <- overlaps_B[overlaps_B$i.start <  overlaps_B$stop,]
overlaps_B <- overlaps_B[overlaps_B$start <  overlaps_B$i.stop,]

####  3. Merge results ####

## A | B
overlaps_A_or_B <- merge.data.table(overlaps_A[,-c(3,4)], overlaps_B[,-c(3,4)],
            by = c("start", "stop"), all = T)
#        start       stop covar_A covar_B
# 1 2022-10-16 2023-01-12 P-int 1  Drug 1
# 2 2023-01-12 2023-03-12 P-int 1    <NA>
# 3 2023-06-16 2023-08-16 P-int 2    <NA>
# 4 2023-08-16 2023-10-12 P-int 2  Drug 2
# 5 2023-10-12 2023-12-12    <NA>  Drug 2

## A & B
overlaps_A_and_B <- na.omit(overlaps_A_or_B)
#        start       stop covar_A covar_B
# 1 2022-10-16 2023-01-12 P-int 1  Drug 1
# 2 2023-08-16 2023-10-12 P-int 2  Drug 2

## A | B + !(A | B)
all_dates <- merge.data.table(all_dates, overlaps_A_or_B,
            by = c("start", "stop"), all = T)

#        start       stop covar_A covar_B
# 1 2022-10-16 2023-01-12 P-int 1  Drug 1
# 2 2023-01-12 2023-03-12 P-int 1    <NA>
# 3 2023-03-12 2023-06-16    <NA>    <NA>
# 4 2023-06-16 2023-08-16 P-int 2    <NA>
# 5 2023-08-16 2023-10-12 P-int 2  Drug 2
# 6 2023-10-12 2023-12-12    <NA>  Drug 2

Raw data

A <- data.table(
  start = as.Date(c("2022-10-16", "2023-06-16")),
  stop  = as.Date(c("2023-03-12", "2023-10-12")),
  covar_A = c("P-int 1", "P-int 2")
)
#        start       stop covar_A
# 1 2022-10-16 2023-03-12 P-int 1
# 2 2023-06-16 2023-10-12 P-int 2

B <- data.table(
  start = as.Date(c("2022-10-16", "2023-08-16")),
  stop  = as.Date(c("2023-01-12", "2023-12-12")),
  covar_B = c("Drug 1", "Drug 2")
)
#        start       stop covar_B
# 1 2022-10-16 2023-01-12  Drug 1
# 2 2023-08-16 2023-12-12  Drug 2

2 Comments

Thank you very much for your answer, it is very precise! I’m starting to get a better understanding of the philosophy behind foverlaps.
It can be done a bit more tightly (see my answer) but +1 for the use of overlap joins to populate the covars
1

I first thought this would be an extremely tricky task, but then realized it could be done with a relatively simple loop.

Note: there might be a more elegant solution that doesn't require a loop and instead uses rowwise directly in the results data frame or some other magic. But for the time being and for the sake of showing step by step what we are doing I created a for-loop.

Note2: even though you created a data.table input, I assume another framework (i.e. tidyverse) is fine. This is because I've no idea about data.table. Assuming that you are probably conducting a clinical study, your data most likely won't be considered as "big", so performance differences between both frameworks would lie in the magnitude of milliseconds and hence could be ignored.


So what are we doing?

  1. We get all unique dates from both data A and B (start and stop columns) and bring them in chronological order.
  2. We create the new results data shell with these unique dates as a start dates and their next-bigger neighbours as stop dates.
  3. We loop through each row of our results data and check in both data sets A and B whether the start date from the results data frame is >= the start date in A/B and whether the stop date from the results data frame is <= the stop date in A/B.
  4. We pull the respective covar_A/B values and write them to the results data.

library(tidyverse)

# Input data
A <- data.frame(start = as.Date(c("2022-10-16", "2023-06-16")),
                stop  = as.Date(c("2023-03-12", "2023-10-12")),
                covar_A = c("P-int 1", "P-int 2"))

B <- data.frame(start = as.Date(c("2022-10-16", "2023-08-16")),
                stop  = as.Date(c("2023-01-12", "2023-12-12")),
                covar_B = c("Drug 1", "Drug 2"))

# Get all dates from the data and sort in ascending order
unique_dates <- sort(unique(c(A$start, A$stop, B$start, B$stop)))

# create empty results object
results <- data.frame(Start   = unique_dates,
                      Stop    = lead(unique_dates),
                      covar_A = NA,
                      covar_B = NA) |> 
  filter(!is.na(Start) & !is.na(Stop))

# loop through all rows of results
for (i in 1:nrow(results))
{
  # filter on cases in data A and B that fall into the respective loop interval
  match_A <- A |> 
    filter(results$Start[i] >= start & results$Stop[i] <= stop) |> 
    pull(covar_A) |> 
    (\(x) if (length(x) == 0) NA else x)()
  
  match_B <- B |> 
    filter(results$Start[i] >= start & results$Stop[i] <= stop) |> 
    pull(covar_B) |> 
    (\(x) if (length(x) == 0) NA else x)()
  
  # Overwrite empty covar_A/B values with values from each loop iteration
  results <- results |> 
    mutate(covar_A = if_else(row_number() == !!i, match_A, covar_A),
           covar_B = if_else(row_number() == !!i, match_B, covar_B))
}

       Start       Stop covar_A covar_B
1 2022-10-16 2023-01-12 P-int 1  Drug 1
2 2023-01-12 2023-03-12 P-int 1    <NA>
3 2023-03-12 2023-06-16    <NA>    <NA>
4 2023-06-16 2023-08-16 P-int 2    <NA>
5 2023-08-16 2023-10-12 P-int 2  Drug 2
6 2023-10-12 2023-12-12    <NA>  Drug 2

The intersection case then is just a sub-case of the above where both covariates are non-missing:

results |>
  filter(!is.na(covar_A) & !is.na(covar_B))

       Start       Stop covar_A covar_B
1 2022-10-16 2023-01-12 P-int 1  Drug 1
2 2023-08-16 2023-10-12 P-int 2  Drug 2

6 Comments

Interesting to see a downvote here without any explanation why. @Alex: in case the downvote is from you and teh solution doesn't solve your problem, please provide more details.
I also don't understand the downvote — it didn’t come from me. Thank you very much for your detailed answer! I haven’t had the time to go through it thoroughly yet, and I’ll test it this week. I used a data.table object because I thought foverlaps() was specifically designed for this kind of manipulation, but I couldn’t make it work properly — so you’re right, a data.table isn’t strictly necessary. For context, I’m working with healthcare administrative data: about ~100k segments across ~10k subjects.
(not my dv) I suspect it's because the author explicitly requested data.table and this is not within that scope, I've seen similar downvoting elsewhere that I infer is for similar reasons. I think Alex's statement here about the presumption of foverlaps is absolutely relevant to the question, it would be useful if the OP added that text. While I have no expectation that this downvoter will revisit and un-dv, perhaps it'll realign future voters' views. Alex, could you do that? I think it'd be helpful.
Thanks for pointing that out. Interesting perspective. I personally don't see a specific request to use data.table. There's just one reference that data.table::foverlaps could help, which I personally interpret as "Hey, here's a hint, maybe it helps. But if you can come up with a different solution, I'm fine with that either.".
What would speak against this theory is that also the data.table solution from marcguery got downvoted. But anyway, that's not an important enough first-world problem.
Good point deschen, I have no idea. BTW, I believe seeing data.table in the absence of dplyr plus sample data explicitly in data.table(.) to suggest requesting a data.table solution. Perhaps that just an over-eager inference.
1

I believe this is getting very close.. could (perhaps) use some tidying and leaves out 'empty' periods where all covars are NA.

Should easily epand to multipe data.tables A, B, C, ...., N

library(dataMojo) # for the row_expand_dates function
# create a list of all input data.tables
L <- list(A, B)
# expand the input data.tabels to one day per row
L2 <- lapply(L, function(x) {
  temp <- row_expand_dates(x, "start", "stop", "date")
  setkey(temp, date)  # set key for joining in next step
  temp[, 3:4] #keep only relevant columns
})
# merge all data.tablesfrom list L2 to a single DT
DT = Reduce(function(...) merge(..., all = TRUE), L2)
# now creating combined intervals agains is easy using rleidv
DT[, interval := rleidv(DT, 2:ncol(DT))]

# OR solution (only keep the first row of each interval (SD[1]))
final <- DT[, end := max(date), by = interval][, .SD[1], by = interval]
#    interval       date covar_A covar_B        end
#       <int>     <Date>  <char>  <char>     <Date>
# 1:        1 2022-10-16 P-int 1  Drug 1 2023-01-12
# 2:        2 2023-01-13 P-int 1    <NA> 2023-03-12
# 3:        3 2023-06-16 P-int 2    <NA> 2023-08-15
# 4:        4 2023-08-16 P-int 2  Drug 2 2023-10-12
# 5:        5 2023-10-13    <NA>  Drug 2 2023-12-12

# AND solution (only keep rows where no covar is missing)
final[complete.cases(final), ]
#    interval       date covar_A covar_B        end
#       <int>     <Date>  <char>  <char>     <Date>
# 1:        1 2022-10-16 P-int 1  Drug 1 2023-01-12
# 2:        4 2023-08-16 P-int 2  Drug 2 2023-10-12

Comments

1

Here's a native-data.table approach.

library(data.table)
alldates <- sort(unique(c(A[, c(start, stop)], B[, c(start, stop)])))
X <- list(
  A, B,
  data.table(start = alldates[-length(alldates)], stop = alldates[-1])
) |>
  lapply(function(x) {
    x[, {
      newdates <- unique(c(start, alldates[ between(alldates, start, stop) ], stop))
      .(st = newdates[-length(newdates)], sp = newdates[-1])
    },  by = names(x)] |>
      _[, c("start", "stop", "st", "sp") := .(st, sp, NULL, NULL)]
  }) |>
  Reduce(x = _, f = function(a, b) merge(a, b, by = c("start", "stop"), all = TRUE)) |>
  setorder(start, stop)

The whole X is the union, and the complete-rows (no-NA) is the intersection:

X
# Key: <start, stop>
#         start       stop covar_A covar_B
#        <Date>     <Date>  <char>  <char>
# 1: 2022-10-16 2023-01-12 P-int 1  Drug 1
# 2: 2023-01-12 2023-03-12 P-int 1    <NA>
# 3: 2023-03-12 2023-06-16    <NA>    <NA>
# 4: 2023-06-16 2023-08-16 P-int 2    <NA>
# 5: 2023-08-16 2023-10-12 P-int 2  Drug 2
# 6: 2023-10-12 2023-12-12    <NA>  Drug 2

X[complete.cases(X),]
# Key: <start, stop>
#         start       stop covar_A covar_B
#        <Date>     <Date>  <char>  <char>
# 1: 2022-10-16 2023-01-12 P-int 1  Drug 1
# 2: 2023-08-16 2023-10-12 P-int 2  Drug 2

There are two reasons I explicitly used a list and lapply over it:

  • To demonstrate that it can work on two tables A and B or as many tables as you want. All columns other than start/stop should be unique or their names will be suffixed (".x", ".y") by the merge operation.
  • To add the literal data.table(start=,stop=) that is needed to bring in the "gap" (row 3 with neither covariate from this data). If you don't truly need the gap, just omit this third element and you'll get a 5-row union where every row has at least one covariate.

There is no reason this cannot be adapted to dplyr or even base R if you determine that data.table is not required (since you said the primary rationale for requesting data.table solutions was because you believed foverlaps to be a useful or necessary component).


Data

A <- data.table::as.data.table(structure(list(start = structure(c(19281, 19524), class = "Date"), stop = structure(c(19428, 19642), class = "Date"), covar_A = c("P-int 1", "P-int 2")), row.names = c(NA, -2L), class = c("data.table", "data.frame")))
B <- data.table::as.data.table(structure(list(start = structure(c(19281, 19585), class = "Date"), stop = structure(c(19369, 19703), class = "Date"), covar_B = c("Drug 1", "Drug 2")), row.names = c(NA, -2L), class = c("data.table", "data.frame")))

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.