Note that the solutions in this answer assume data grouped by a patient ID, as mentioned in the question (but not in the example given).
The key is your comment:
By union, I mean all distinct time segments defined by any start or stop boundary from either A or B.
i.e. every date in the four corresponding columns starts a new spell in the "union" (a new {A state, B state} pair), and there will be 2*nrow(A) + 2*nrow(B) such spells (including a censored post-stop spell at the end of each patient's history). A simple, scalable, extremely fast solution is therefore: stack |> melt |> sort |> fill A,B states (all grouped by, or ideally vectorised over, patients).
(On their own, the "intersections" between A and B can be done even more simply as an inner overlapping-range join, selecting the later of the two starts and earlier of the two stops at each joined row.)
Here are timings on data at the scale described in the question (A and B each with 100K rows, 10K distinct patients). The union takes about 1/8 of a second and the intersections about 1/30.
library(data.table)
# Full-scale data: A and B 100K rows each, 10K distinct patients
args <- list(n_i=1e4L, n_t=10)
idx_t <- rep(1:args$n_t, args$n_i);
set.seed(1)
A <- do.call(gen_spells, args)[, covar_A := paste("Pt int", idx_t)][]
B <- do.call(gen_spells, args)[, covar_B := paste("Drug", idx_t)][]
> print(bm, unit="seconds",signif=3)
Unit: seconds
expr min lq mean median uq max neval
union_fill(A, B) 0.3960 0.4170 0.4360 0.4290 0.4430 0.558 100
union_fill_zoo(A, B) 0.1040 0.1110 0.1380 0.1180 0.1250 0.235 100
union_join(A, B) 0.1310 0.1440 0.1570 0.1470 0.1540 0.265 100
intersections(A, B) 0.0273 0.0293 0.0369 0.0314 0.0378 0.149 100
Code for gen_spells() is at the bottom of this post.
I spent more time than was healthy on the details of the fill step. For each start in A we need to fill covar_A forward to one row before its corresponding stop, and fill with NA from that stop to the next start (and the same with covar_B, while respecting the boundaries between patient histories). The variants above are:
union_fill(): replace with covar[1L] or NA, grouped by the appropriate runs
union_fill_zoo(): replace covar at each stop with a sentinel string and forward fill (using zoo::na.locf(), since data.table::na.fill() can't fill character vectors)
union_join(): don't carry the covars into the melt and just use a join to populate them, instead of a fill
I think union_join() is the most elegant (and union_fill_zoo() the least!). I didn't realise until I looked at timings for other solutions (see below) but @marcguery's answer does the same thing.
NB it would be possible to extend these solutions efficiently to more than two tables as some of the other solutions do.
Solution code
I've used column names "from" and "to" in the output. I've given union_fill() a slightly different output: it keeps one transition per row (so has zero-length spells when there are simultaneous events in A and B), and tags the transition. union_join() doesn't have this (it actually can't), but you probably don't want it anyway.
install.packages('fjoin', repos = c('https://trobx.r-universe.dev'))
# A CRAN-ready general data frame join pkg released last week
# It autogenerates and runs extended data.table code (and does some no-copy
# object-handling around the edges to accept/return non-data.tables)
union_fill <- function(A, B) {
# one transition per row (hence zero-length spells if simultaneous)
library(data.table)
for (dt in list(A,B)) setkeyv(dt, c("id","start","stop"))
# stack A and B
X <- rbind(A=A, B=B, fill=TRUE, idcol="which_table")
# melt starts and stops
X <- melt(X, id.vars = c("id", "covar_A", "covar_B", "which_table"),
measure.vars = c("start", "stop"),
value.name = "from",
variable.name = "which_column", variable.factor = FALSE)
# sort
setkey(X, id, from)
# fill A state, fill B state
X[, is_stop := which_column=="stop"]
X[, covar_A := if(is_stop[1L]) NA_character_ else covar_A[1L], by=.(id, cumsum(which_table=="A"))]
X[, covar_B := if(is_stop[1L]) NA_character_ else covar_B[1L], by=.(id, cumsum(which_table=="B"))]
X[, is_stop := NULL]
# add "to" column, drop final post-stop spell per history
X[, to := fifelse(id == shift(id, type="lead"), shift(from, type="lead"), NA)]
X <- X[id==shift(id, type="lead")]
setcolorder(X, c("id","from","to","covar_A","covar_B","which_table","which_column"))[]
}
union_join <- function(A, B) {
library(data.table)
library(fjoin)
for (dt in list(A,B)) setkeyv(dt, c("id","start","stop"))
# stack A and B key cols (no covars)
X <- rbind(A[,.SD,.SDcols=key(A)], B[,.SD,.SDcols=key(B)])
# melt starts and stops
X <- melt(X, id.vars = "id",
measure.vars = c("start", "stop"),
value.name = "from",
variable.factor = FALSE) |>
_[, variable := NULL]
# sort, deduplicate, add "to" column, drop final post-stop spell per history
X <- unique(setkey(X))
X[, to := fifelse(id == shift(id, type="lead"), shift(from, type="lead"), NA)]
X <- X[id==shift(id,type="lead")]
# populate the states using M:1 open-overlap joins
setkey(X, id, from, to)
X <- fjoin_left(X, A, on=c("id", "from < stop", "to > start"), select = "covar_A")
X <- fjoin_left(X, B, on=c("id", "from < stop", "to > start"), select = c("covar_A", "covar_B"))
X[, c("R.start","R.stop") := NULL][]
}
intersections <- function(A, B) {
library(data.table)
library(fjoin)
# inner overlap join, then take the later start and earlier stop
X <- fjoin_inner(A, B, on=c("id", "start <= stop", "stop >= start"))
X[, `:=`(from=pmax(start, R.start), to=pmin(stop, R.stop))]
# cosmetics
setcolorder(X, c("id","from","to","covar_A","covar_B"))[]
}
Demo on toy data
For realism/checking, this toy data has:
- a Patient ID column
- Extended overlaps on each side
- An empty spell in the middle
- A tied stop and a tied start
- A Patient 2 to demonstrate that things work across the boundary between patients
A_toy <- fread("
id start stop covar_A
1 2025-01-01 2025-01-27 'Pt int 1'
1 2025-02-11 2025-02-17 'Pt int 2'
1 2025-02-21 2025-03-01 'Pt int 3'
1 2025-03-06 2025-03-13 'Pt int 4'
", quote="'")
B_toy <- fread("
id start stop covar_B
1 2025-01-05 2025-01-10 'Drug 1'
1 2025-01-25 2025-01-30 'Drug 2'
1 2025-02-04 2025-03-01 'Drug 3'
1 2025-03-06 2025-03-11 'Drug 4'
2 2025-01-03 2025-01-10 'Drug 1'
", quote="'")
setkey(A_toy, id, start, stop)
setkey(B_toy, id, start, stop)
Checking the "union" solutions:
# compare union (full natural join with indicator)
fjoin_full(
union_fill(A_toy,B_toy),
union_join(A_toy,B_toy),
on=NA, # natural join
match.na=TRUE, # fjoin is NA-safe by default but we *want* NA matches here
indicate=TRUE # Stata-style indicator (1 left only, 2 right only, 3 joined)
)
.join id from to covar_A covar_B which_table which_column
<int> <int> <IDat> <IDat> <char> <char> <char> <char>
1: 3 1 2025-01-01 2025-01-05 Pt int 1 <NA> A start
2: 3 1 2025-01-05 2025-01-10 Pt int 1 Drug 1 B start
3: 3 1 2025-01-10 2025-01-25 Pt int 1 <NA> B stop
4: 3 1 2025-01-25 2025-01-27 Pt int 1 Drug 2 B start
5: 3 1 2025-01-27 2025-01-30 <NA> Drug 2 A stop
6: 3 1 2025-01-30 2025-02-04 <NA> <NA> B stop
7: 3 1 2025-02-04 2025-02-11 <NA> Drug 3 B start
8: 3 1 2025-02-11 2025-02-17 Pt int 2 Drug 3 A start
9: 3 1 2025-02-17 2025-02-21 <NA> Drug 3 A stop
10: 3 1 2025-02-21 2025-03-01 Pt int 3 Drug 3 A start
11: 1 1 2025-03-01 2025-03-01 <NA> Drug 3 A stop
12: 3 1 2025-03-01 2025-03-06 <NA> <NA> B stop
13: 1 1 2025-03-06 2025-03-06 Pt int 4 <NA> A start
14: 3 1 2025-03-06 2025-03-11 Pt int 4 Drug 4 B start
15: 3 1 2025-03-11 2025-03-13 Pt int 4 <NA> B stop
16: 3 2 2025-01-03 2025-01-10 <NA> Drug 1 B start
Note that union_fill() has extra columns which_table and which_column and two extra rows (11 and 13).
"Intersections":
# check intersections
fjoin_full(
intersections(A_toy,B_toy),
union_join(A_toy,B_toy)[!(is.na(covar_A) | is.na(covar_B))],
on=NA,
select.y = "", # leave out the "which_" cols
indicate=TRUE
)
.join id from to covar_A covar_B start stop R.start R.stop
<int> <int> <IDat> <IDat> <char> <char> <IDat> <IDat> <IDat> <IDat>
1: 3 1 2025-01-05 2025-01-10 Pt int 1 Drug 1 2025-01-01 2025-01-27 2025-01-05 2025-01-10
2: 3 1 2025-01-25 2025-01-27 Pt int 1 Drug 2 2025-01-01 2025-01-27 2025-01-25 2025-01-30
3: 3 1 2025-02-11 2025-02-17 Pt int 2 Drug 3 2025-02-11 2025-02-17 2025-02-04 2025-03-01
4: 3 1 2025-02-21 2025-03-01 Pt int 3 Drug 3 2025-02-21 2025-03-01 2025-02-04 2025-03-01
5: 3 1 2025-03-06 2025-03-11 Pt int 4 Drug 4 2025-03-06 2025-03-13 2025-03-06 2025-03-11
Other answers
The other answers all also give the same correct output on the toy data (@Wimpel's doesn't include the empty spells (which they could easily add), and defines contiguous spells like 1-4,5-7).
None of them incorporate multiple patients, but a dirty method of doing this is to wrap them in by.
# 1000 histories of length 10
args <- list(n_i=1000, n_t=10)
idx_t <- rep(1:args$n_t, args$n_i); set.seed(1)
A <- do.call(gen_spells, args)[, covar_A := paste("Pt int", idx_t)][]
B <- do.call(gen_spells, args)[, covar_B := paste("Drug", idx_t)][]
for (x in list(A2 <- copy(A), B2 <- copy(B))) x[, names(.SD) := lapply(.SD, as.Date), .SDcols=c("start","stop")]
do_by_id <- function(f,x,y) x[, f(.SD,y[id==.BY,.SD,.SDcols=-"id"]), by=id]
> print(rbind(bm3a,bm3b), unit="seconds", signif=3)
Unit: seconds
expr min lq mean median uq max neval
union_fill(A, B) 0.0526 0.0536 0.0583 0.0565 0.0604 0.0724 10
union_join(A, B) 0.0307 0.0335 0.0350 0.0356 0.0370 0.0376 10
do_by_id(marcguery, A2, B2) 12.2000 13.1000 13.4000 13.5000 13.8000 14.3000 10
do_by_id(Wimpel, A, B) 10.1000 11.2000 11.6000 11.8000 12.2000 12.2000 10
do_by_id(r2evans, A, B) 112.0000 112.0000 112.0000 112.0000 112.0000 112.0000 1
do_by_id(deschen, A, B) 168.0000 168.0000 168.0000 168.0000 168.0000 168.0000 1
Please note well that the timings of the other solutions would be far faster if they were specifically adapted to cope with multiple patients, rather than being bunged into a general split-apply-combine setup. But even doing it this way, the worst case is ~30 mins on the full-scale data (10K such histories), so all of them are feasible on the task at hand, which is not time-sensitive (@deschen's point).
> all.equal(do_by_id(deschen,A,B), union_join(A,B),
use.names=FALSE, check.attributes=FALSE)
TRUE
Overlap joins
Part of your question refers to data.table's foverlaps() function. This was intended for asymmetric data typical in genomics. However, in my experience/playing around (which is not genomics) it is very often outperformed by an inequality join using standard data.table join machinery. (Note that inequality joins in data.table came after foverlaps().)
?foverlaps explains how the different overlap types translate to inequalities. It takes seeing, but A and B overlap iff start_A <= stop_B & stop_A >= start_B (where starting and stopping on the same day counts as overlapping, hence the weak inequalities). I find it easier to see this as !(start_A > stop_B | start_B > stop_A), i.e. they don't not overlap. But that doesn't work as a join predicate in data.table!
I've used a data.table inequality join in intersections(), which is several times faster than foverlaps(), though they are both basically instantaneous:
# intersections - overlap join step only (full-scale data)
data.table_inequality <- \(A,B) fjoin_inner(A,B,on=c("id","start <= stop","stop >= start"))
> print(bm_foverlaps, unit="seconds",signif=3)
Unit: seconds
expr min lq mean median uq max neval
data.table_inequality(A, B) 0.0209 0.0221 0.0265 0.0228 0.0234 0.193 100
foverlaps(A, B) 0.0561 0.0586 0.0762 0.0631 0.0787 0.254 100
foverlaps() requires keyed data and as a result it is very short and sweet - you typically don't need to specify the join columns. But there are times when you can't use it, e.g. because you need a strict inequality. That's actually the case in union_join(). In that case you have to patch things up post hoc (as in @marcguery's code), or write a data.table inequality join directly, which is notoriously painful because of its garbling of join column outputs (which is logical and systematic, but inconvenient). fjoin removes the pain because it autogenerates the data.table code for you. You can view this code instead of executing it, with do=FALSE. Going back to intersections() to illustrate:
> fjoin::fjoin_inner(
+ A, B, on=c("id", "start<=stop", "stop>=start"),
+ select=c("covar_A","covar_B"),
+ do=FALSE
+ )
.DT : y = B
.i : x = A
Join: setDT(.DT[.i, on = c("id", "stop >= start", "start <= stop"), nomatch = NULL,
data.frame(id = i.id, start = i.start, stop = i.stop, R.start = x.start, R.stop = x.stop,
covar_A, covar_B)])[]
(I've also added a select arg here in case you have irrelevant columns in your real-life A and B.) You could now swipe that data.table code and edit the j-expression to make the whole intersections solution a true one-liner, partly "ghostwritten" by fjoin:
setDT(
B[A,
on = c("id", "stop >= start", "start <= stop"),
nomatch = NULL,
data.frame(
id = i.id,
from = pmax(i.start, x.start), to = pmax(i.stop, x.stop),
covar_A, covar_B,
start_A = i.start, stop_A = i.stop,
start_B = x.start, stop_B = x.stop
)]
)[]
You might notice that fjoin avoids j=list(). This is because that deep-copies the vectors, which is unnecessary when they are fresh outputs of a join. But enough about fjoin, except to say that this is its first outing, and if you do choose to use it, you will be its patient zero!
(Code for artificial data)
gen_spells <- function(
n_i=1e4L, # patients
n_t=10L, # spells per patient
off_max=60L, # max gap between spells
on_max=180L, # max duration of spells
offset=0L, # offset from base
base=as.IDate("2020-01-01")
) {
dt <- data.table(
id=rep(1:n_i,each=n_t),
off=sample(off_max, n_i*n_t, TRUE),
on=sample(on_max, n_i*n_t, TRUE)
)
dt[, stop:=base-1L+cumsum(off+on), keyby=id][, start:=stop-on]
kcols <- c("id","start","stop")
setcolorder(dt[, c("off", "on") := NULL], kcols)
setkeyv(dt, kcols)[]
}
left_joinwith bothstartandstopas "by" variables. For the union case, can you please elaborate a bit more, e.g.2023-01-12does not appear as astartdate in any of your data sets, yet you show it in your expected output.b = sort(unique(c(A$start, A$stop, B$start, B$stop))); U = data.table(start=head(b, -1), stop=tail(b, -1)); U[A, on=.(start < stop, stop > start), covar_A := i.covar_A, by=.EACHI][B, on=.(start < stop, stop > start), covar_B := i.covar_B, by=.EACHI][]? And intersection is just filtering for non-NA?