Tidyup 8 - Expanding the filter() family#30
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
jennybc
left a comment
There was a problem hiding this comment.
I like the proposal! Made a few comments as I reacted to a first reading.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
tidyups/008-dplyr-filter-family.md Line 948 in 5b76b43 FWIW, as a user I would much prefer the name Love the idea for this API btw! |
|
@wurli most of us felt that We also really appreciated how it feels like a "variant" of With |
|
With |
|
Love this implementation, I do think the |
|
Awesome proposal! My 2 cents - I think filter/filter_out is slightly unclear naming. I think filter_keep/filter_drop would be better with filter deprecated |
|
@davidhodge931 as stated in the tidyup at https://github.com/tidyverse/tidyups/blob/feature/008/008-dplyr-filter-family.md#alternate-names-for-filter, we are not considering renaming |
|
Love the idea! How do you teach that # Sequence with filter()
. |>
filter(x) |>
filter(y)
# Same as conjunction
. |>
filter(x, y)
# Sequence with filter_out()
. |>
filter_out(x) |>
filter_out(y)
# Same as alternation (!?!)
. |>
filter_out(x | y) |
|
I think the best way to teach this is probably something like:
# Combining with `&`
df |> filter(x, y)
df |> filter_out(x, y)
# Combining with `|`
df |> filter(when_any(x, y))
df |> filter_out(when_any(x, y))I think the fact that |
|
To me, the antisymmetry is not only theoretically pleasing. I'm reading
I'd never read it like:
Even stronger with . |>
filter_out(
x,
y
)To me, the |
|
Completely agree with @krlmlr here. I think this is a critical function of the api that makes learning the syntax much easier, especially for beginners. I would expect |
|
There are two competing worldviews at play here.
Both of these have their pros and cons. My theory is that the first of these is the most practically useful for dplyr users and is the easiest to learn. As complementsIf both df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))
df |> filter_out(x, y)
df |> filter_out(x & y)
df |> filter_out(when_all(x, y))
# ---
df |> filter_out(x | y)
df |> filter_out(when_any(x, y))
df |> filter(x | y)
df |> filter(when_any(x, y))Notice how everything above the line related to I'd argue that an extremely important property of this table is that you only have to learn 1 rule - that As a nice side effect this means you only need to worry about This all means that if you are translating from a
patients <- tibble::tibble(
name = c("Anne", "Mark", "Sarah", "Davis", "Max", "Derek", "Tina"),
deceased = c(FALSE, TRUE, NA, TRUE, NA, FALSE, TRUE),
date = c(2005, 2010, NA, 2020, 2010, NA, NA)
)
patientsWith years of patients |>
filter(!(deceased & date < 2012))But immediately get frustrated when it drops your patients |>
filter_out(deceased & date < 2012)And boom that works as expected. And since there is only 1 rule that applies for both patients |>
filter_out(deceased, date < 2012)You also get this nice result, i.e. they are complements of one another # Equivalent up to row ordering
union(filter(df, x, y), filter_out(df, x, y)) ~= dfIt is true that you can't break df |> filter(x, y)
df |> filter(x & y)
df |> filter(x) |> filter(y)
df |> filter_out(x | y)
df |> filter_out(x) |> filter_out(y)But I'd argue that was never a goal to begin with, and is not how I would teach them. For example, if I'm looking for "rows where df |> filter(cyl == 5, disp > 20)and it would not occur to me to write this, even though they are equivalent df |> filter(cyl == 5) |> filter(disp > 20)In other words, my problem statement of "rows where This also means that I don't find Kirill's idea that I think a more appropriate goal of As chainable equivalentsIf df |> filter(x, y)
df |> filter(x & y)
df |> filter(when_all(x, y))
df |> filter_out(x, y)
df |> filter_out(x | y)
df |> filter_out(when_any(x, y))
# ---
df |> filter(x | y)
df |> filter(when_any(x, y))
df |> filter_out(x & y)
df |> filter_out(when_all(x, y))My argument is that this is actually much harder for people to learn.
And this is on top of having to think about But most importantly, you can no longer easily translate a patients |>
filter(!(deceased & date < 2012))then you have to translate to this patients |>
filter_out(when_all(deceased, date < 2012))and I'd argue that is an increase in mental burden to translate to over the "just drop the In my ideal world both This approach does have this "chainable equivalence" property that has been discussed, but I'd again argue that this is not a design goal, and is not the way I'd encourage teaching df |> filter(x, y)
df |> filter(x) |> filter(y)
df |> filter_out(x, y)
df |> filter_out(x) |> filter_out(y)So why do
|
|
In ordinary English, when we talk about removing things, “X and Y” is almost always understood as “anything that is X or Y,” i.e. a union of categories to exclude, not a logical “and” inside a single condition. Examples:
And without “filter” language at all:
In all these cases “X and Y” is just a list of things to get rid of: “get rid of X, and also get rid of Y,” which is logically “X or Y” on the exclusion side. If you also think of |
|
@t-kalinowski that is a good example to think about, but I do not think it is as compelling as you think it is because it is the same for keeping things. Examples:
So IMO this cannot be used as an argument for "filter out combines with These examples are all somewhat interesting because they only involve a single variable, and the way they have been written actually translates to a single
Something about the way the English You'd have to say it like this to mean intersection, and written this way it kind of implies you have separate
I also do not think we agree on what a complement means?
I don't think so? If you're filling in a set venn diagram and you start with
then that is where the A and B circles overlap. To get the complement, shade every part of the diagram except where A and B overlap, and that gives you
If you stare at that for a bit, you see that an equivalent way to say this is to drop the part where A and B overlap, which is:
I think the confusion can come in if you try and perform the complement in your head at the same time that you switch verbs from "keep" to "drop". This confused me numerous times while writing the tidyup until I wrote the venn diagram down on paper. Regardless, that means that "drop where either A or B are TRUE" is definitely not the complement of "keep where both A and B are TRUE". |
|
Allllll of this discussion really leads us back to a single key point: For both That's it. That's the whole confusion right there. I think we have in our heads that Here's an "ideal world" thought. Take a page from Zen of Python and adopt the rule of "In the face of ambiguity, refuse the temptation to guess". Remove the ambiguity altogether by doing what Stata does and what Kirill said before - limit to only 1 expression. retain(data, when, ..., by = NULL)
exclude(data, when, ..., by = NULL)
when_all(...)
when_any(...)
if_all(cols, fn)
if_any(cols, fn)I think everyone wins here.
So you can write things like cars |> retain(class == "suv" & mpg < 15)
cars |> retain(when_all(
class == "suv",
mpg < 15
))
cars |> retain(class == "suv" | mpg < 15)
cars |> retain(when_any(
class == "suv",
mpg < 15
))
cars |> exclude(class == "suv" & mpg < 15)
cars |> exclude(when_all(
class == "suv",
mpg < 15
))
cars |> exclude(class == "suv" | mpg < 15)
cars |> exclude(when_any(
class == "suv",
mpg < 15
))And still use cars |> exclude(if_any(c(x, y, z), is.na))I think this is...beautiful? It has a very nice symmetry to it, and all of the ambiguity we've been confused over has disappeared. The main issue with it is that introducing a new name for |
|
Thanks @DavisVaughan! You convinced me that |
|
At this point I really see two options
|
|
After the further discussion, I don't think there is a viable way to create I think one unintuitive result of combining Maybe this is just me, but my intuition would tell me the first call removes more rows but the opposite behavior is true. IMO @DavisVaughan proposal for
I think the net benefits of solving both of these issues is worth the cost of potentially superseding I guess I don't see a compelling reason why |
That doesn't feel unintuitive to me. You are tightening the bounds on what to drop. Importantly, it works the same way as
As mentioned in #30 (comment) (sent at roughly the same time as your message), I came to the opposite conclusion
countries |>
filter(
(name %in% c("US", "CA") & between(score, 200, 300)) |
(name %in% c("PR", "RU") & between(score, 100, 200)) |
(name %in% c("JP", "CH") & between(score, 400, 600))
)
# VS
countries |>
filter(when_any(
name %in% c("US", "CA") & between(score, 200, 300),
name %in% c("PR", "RU") & between(score, 100, 200),
name %in% c("JP", "CH") & between(score, 400, 600)
))They are also faster than repeated They also provide a useful Outside of the context of I'd write your example like this df |>
exclude(when_all(
x == 0 | is.na(x),
y > 5
))
df |>
exclude(when_all(
x %in% c(0, NA),
y > 5
))I don't think there is anything wrong with using |
|
I see appeal in the
Does this still hold in the presence of missing values in the predicates? If not, can we create a similar invariant using I had to look up and manually test the semantics of It looks like any solution that we come up with here will be a tradeoff. Realistically, the only way a larger user base can play with it is by sending an experimental version to CRAN. I assume the new functions will be tagged "experimental", with some opportunity to adapt as needed? |
|
If you'd like to try these yourself, I've pushed a WIP to pak::pak("tidyverse/dplyr@feature/filter-out-2")Just so we are all talking about the same thing, here's the output table for library(dplyr)
df <- tibble(
x = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, NA, NA, NA),
y = c(TRUE, FALSE, NA, TRUE, FALSE, NA, TRUE, FALSE, NA)
)
df |>
mutate(
any_propagate = when_any(x, y, na_rm = FALSE),
any_remove = when_any(x, y, na_rm = TRUE),
all_propagate = when_all(x, y, na_rm = FALSE),
all_remove = when_all(x, y, na_rm = TRUE)
)
#> # A tibble: 9 × 6
#> x y any_propagate any_remove all_propagate all_remove
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE TRUE TRUE TRUE TRUE TRUE
#> 2 TRUE FALSE TRUE TRUE FALSE FALSE
#> 3 TRUE NA TRUE TRUE NA TRUE
#> 4 FALSE TRUE TRUE TRUE FALSE FALSE
#> 5 FALSE FALSE FALSE FALSE FALSE FALSE
#> 6 FALSE NA NA FALSE FALSE FALSE
#> 7 NA TRUE TRUE TRUE NA TRUE
#> 8 NA FALSE NA FALSE FALSE FALSE
#> 9 NA NA NA FALSE NA TRUERegarding option renaming, we already have
I'm assuming you are questioning whether the union claim holds in The important part of
Yea, totally
Yea, we will mark it as experimental for a release or so. Note that we have already gotten lots of great feedback about this idea via bluesky and linkedin. |
|
Here's another example I think will be quite common. Interactively I will probably want to confirm that I'm about to drop the rows I think I'm going to drop, so interactively I'll do df |> filter(these_rows, those_rows)then I'll stare at the output and make sure those are the rows i want to drop. Once I'm happy with that, all I currently have to do is change to df |> filter_out(these_rows, those_rows)and boom now those rows are dropped. That's pretty nice! That doesn't hold if we combine with (I'm trying to document examples like these here, as they will eventually make their way into the tidyup under a new section of something like |
|
I was surprised to see, with the PR: pkgload::load_all()
#> ℹ Loading dplyr
df <- tidyr::expand_grid(a = c(TRUE, FALSE, NA), b = c(TRUE, FALSE, NA))
df |>
filter(a, b)
#> # A tibble: 1 × 2
#> a b
#> <lgl> <lgl>
#> 1 TRUE TRUE
df |>
filter_out(a, b)
#> # A tibble: 8 × 2
#> a b
#> <lgl> <lgl>
#> 1 TRUE FALSE
#> 2 TRUE NA
#> 3 FALSE TRUE
#> 4 FALSE FALSE
#> 5 FALSE NA
#> 6 NA TRUE
#> 7 NA FALSE
#> 8 NA NACreated on 2025-11-26 with reprex v2.1.1 This means that It makes much more sense now, thanks for your patience! |
# dplyr 1.2.1 * dplyr is now fully compliant with the R C API (#7819). # dplyr 1.2.0 ## New features * New `filter_out()` companion to `filter()`. * Use `filter()` when specifying rows to _keep_. * Use `filter_out()` when specifying rows to _drop_. `filter_out()` simplifies cases where you would have previously used a `filter()` to drop rows. It is particularly useful when missing values are involved. For example, to drop rows where the `count` is zero: ```r df |> filter(count != 0 | is.na(count)) df |> filter_out(count == 0) ``` With `filter()`, you must provide a "negative" condition of `!= 0` and must explicitly guard against accidentally dropping rows with `NA`. With `filter_out()`, you directly specify rows to drop and you don't have to guard against dropping rows with `NA`, which tends to result in much clearer code. This work is a result of [Tidyup 8: Expanding the `filter()` family](tidyverse/tidyups#30), with a lot of great feedback from the community (#6560, #6891). * New `when_any()` and `when_all()`, which are elementwise versions of `any()` and `all()`. Alternatively, you can think of them as performing repeated `|` and `&` on any number of inputs, for example: * `when_any(x, y, z)` is equivalent to `x | y | z`. * `when_all(x, y, z)` is equivalent to `x & y & z`. `when_any()` is particularly useful within `filter()` and `filter_out()` to specify comma separated conditions combined with `|` rather than `&`, like: ```r # With `|` countries |> filter( (name %in% c("US", "CA") & between(score, 200, 300)) | (name %in% c("PR", "RU") & between(score, 100, 200)) ) # With `when_any()`, you drop the explicit `|`, the extra `()`, and your # conditions are all indented to the same level countries |> filter(when_any( name %in% c("US", "CA") & between(score, 200, 300), name %in% c("PR", "RU") & between(score, 100, 200) )) # To drop these rows instead, use `filter_out()` countries |> filter_out(when_any( name %in% c("US", "CA") & between(score, 200, 300), name %in% c("PR", "RU") & between(score, 100, 200) )) ``` This work is a result of [Tidyup 8: Expanding the `filter()` family](tidyverse/tidyups#30). * `case_when()` is now part of a family of 4 related functions, 3 of which are new: * Use `case_when()` to create a new vector based on logical conditions. * Use `replace_when()` to update an existing vector based on logical conditions. * Use `recode_values()` to create a new vector by mapping all old values to new values. * Use `replace_values()` to update an existing vector by mapping some old values to new values. Learn all about these in a new vignette, `vignette("recoding-replacing")`. `replace_when()` is particularly useful for conditionally mutating rows within one or more columns, and can be thought of as an enhanced version of `base::replace()`. `recode_values()` and `replace_values()` have the familiar `case_when()`-style formula interface for easy interactive use, but also have `from` and `to` arguments as a way for you to incorporate a pre-built lookup table, making them more holistic replacements for both `case_match()` and `recode()`. This work is a result of [Tidyup 7: Recoding and replacing values in the tidyverse](https://github.com/tidyverse/tidyups/blob/main/007-tidyverse-recoding-and-replacing.md), with a lot of great [feedback](tidyverse/tidyups#29) from the community (#7728, #7729). * `case_when()` has gained a new `.unmatched` argument. For extra safety, set `.unmatched = "error"` rather than providing a `.default` when you believe that you've handled every possible case, and it will error if a case is left unhandled. The new `recode_values()` also has this argument (#7653). * `if_else()`, `case_when()`, and `coalesce()` have gotten significantly faster and use much less memory due to a rewrite in C via vctrs (#7723, #7725, #7727). * New `ptype` argument for `between()`, allowing users to specify the desired output type. This is particularly useful for ordered factors and other complex types where the default common type behavior might not be ideal (#6906, @JamesHWade). * New `rbind()` method for `rowwise_df` to avoid creating corrupt rowwise data frames (r-lib/vctrs#1935). ## Lifecycle changes ### Newly stable * `.by` has moved from experimental to stable (#7762). * `reframe()` has moved from experimental to stable (#7713, @VisruthSK). ### Newly breaking * `if_else()` no longer allows `condition` to be a logical array. It must be a logical vector with no `dim` attribute (#7723). ### Newly deprecated * `case_match()` is soft-deprecated, and is fully replaced by `recode_values()` and `replace_values()`, which are more flexible, more powerful, and have much better names. * In `case_when()`, supplying all size 1 LHS inputs along with a size >1 RHS input is now soft-deprecated. This is an improper usage of `case_when()` that should instead be a series of if statements, like: ```r # Scalars! code <- 1L flavor <- "vanilla" # Improper usage: case_when( code == 1L && flavor == "chocolate" ~ x, code == 1L && flavor == "vanilla" ~ y, code == 2L && flavor == "vanilla" ~ z, .default = default ) # Recommended: if (code == 1L && flavor == "chocolate") { x } else if (code == 1L && flavor == "vanilla") { y } else if (code == 2L && flavor == "vanilla") { z } else { default } ``` The recycling behavior that allows this style of `case_when()` to work is unsafe, and can result in silent bugs that we'd like to guard against with an error in the future (#7082). * The `dplyr.legacy_locale` global option is soft-deprecated. If you used this to affect the ordering of `arrange()`, use `arrange(.locale =)` instead. If you used this to affect the ordering of `group_by() |> summarise()`, follow up with an additional call to `arrange(.locale =)` instead (#7760). * Passing `size` to `if_else()` is now deprecated. The output size is always taken from the `condition` (#7722). ### Other deprecation advancements * The following were already deprecated, and are now defunct and throw an error: * All underscored standard evaluation versions of major dplyr verbs. Deprecated in 0.7.0 (Jun 2017), use the non-underscored version of the verb with unquoting instead, see `vignette("programming")`. This includes: * `add_count_()` * `add_tally_()` * `arrange_()` * `count_()` * `distinct_()` * `do_()` * `filter_()` * `funs_()` * `group_by_()` * `group_indices_()` * `mutate_()` * `tally_()` * `transmute_()` * `rename_()` * `select_()` * `slice_()` * `summarise_()` * `summarize_()` * `mutate_each()`, `mutate_each_()`, `summarise_each()`, and `summarise_each_()`. Deprecated in 0.7.0 (Jun 2017), use `across()` instead. * Returning more or less than 1 row per group in `summarise()`. Deprecated in 1.1.0 (Jan 2023), use `reframe()` instead. * `combine()`. Deprecated in 1.0.0 (May 2020), use `c()` or `vctrs::vec_c()` instead. * `src_mysql()`, `src_postgres()`, `src_sqlite()`, `src_local()`, and `src_df()`. Deprecated in 1.0.0 (May 2020), use `tbl()` instead. * `tbl_df()` and `as.tbl()`. Deprecated in 1.0.0 (May 2020), use `tibble::as_tibble()` instead. * `add_rownames()`. Deprecated in 1.0.0 (May 2020), use `tibble::rownames_to_column()` instead. * The `.drop` argument of `add_count()`. Deprecated in 1.0.0 (May 2020), had no effect. * The `add` argument of `group_by()` and `group_by_prepare()`. Deprecated in 1.0.0 (May 2020), use `.add` instead. * The `.dots` argument of `group_by()` and `group_by_prepare()`. Deprecated in 1.0.0 (May 2020). * The `...` argument of `group_keys()` and `group_indices()`. Deprecated in 1.0.0 (May 2020), use `group_by()` first. * The `keep` argument of `group_map()`, `group_modify()`, and `group_split()`. Deprecated in 1.0.0 (May 2020), use `.keep` instead. * Using `across()` and data frames in `filter()`. Deprecated in 1.0.8 (Feb 2022), use `if_any()` or `if_all()` instead. * `multiple = NULL` in joins. Deprecated in 1.1.1 (Mar 2023), use `multiple = "all"` instead. * `multiple = "error" / "warning"` in joins. Deprecated in 1.1.1 (Mar 2023), use `relationship = "many-to-one"` instead. * The `vars` argument of `group_cols()`. Deprecated in 1.0.0 (Jan 2023). * The following were already deprecated, and now warn unconditionally if used: * `all_equal()`. Deprecated in 1.1.0 (Jan 2023), use `all.equal()` instead. * `progress_estimated()`. Deprecated in 1.0.0 (May 2020). * `filter()` with a 1 column matrix. Deprecated in 1.1.0 (Jan 2023), use a vector instead. * `slice()` with a 1 column matrix. Deprecated in 1.1.0 (Jan 2023), use a vector instead. * Not supplying the `.cols` argument of `across()`. Deprecated in 1.1.0 (Jan 2023). * `group_indices()` with no arguments. Deprecated in 1.0.0 (May 2020), use `cur_group_id()` instead. * The following were already soft-deprecated, and now warn once per session if used: * `cur_data()` and `cur_data_all()`. Deprecated in 1.1.0 (Jan 2023), use `pick()` instead. * The `...` argument of `across()`. Deprecated in 1.1.0 (Jan 2023), use an anonymous function instead. * Using `by = character()` to perform a cross join. Deprecated in 1.1.0 (Jan 2023), use `cross_join()` instead. ### Removed The following were already defunct, and have been removed: * `id()`. Deprecated in 0.5.0 (Jun 2016), use `vctrs::vec_group_id()` instead. If your package uses NSE and implicitly relied on the variable `id` being available, you now need to put `utils::globalVariables("id")` inside one of your package files to tell R that `id` is a column name. * `failwith()`. Deprecated in 0.7.0 (Jun 2017), use `purrr::possibly()` instead. * `select_vars()` and `select_vars_()`. Deprecated in 0.8.4 (Jan 2020), use `tidyselect::vars_select()` instead. * `rename_vars()` and `rename_vars_()`. Deprecated in 0.8.4 (Jan 2020), use `tidyselect::vars_rename()` instead. * `select_var()`. Deprecated in 0.8.4 (Jan 2020), use `tidyselect::vars_pull()` instead. * `current_vars()`. Deprecated in 0.8.4 (Jan 2020), use `tidyselect::peek_vars()` instead. * `bench_tbls()`, `compare_tbls()`, `compare_tbls2()`, `eval_tbls()`, and `eval_tbls2()`. Deprecated in 1.0.0 (May 2020). * `location()` and `changes()`. Deprecated in 1.0.0 (May 2020), use `lobstr::ref()` instead. ## Minor improvements and bug fixes * The base pipe is now used throughout the documentation (#7711). * The superseded `recode()` now has updated documentation showing how to migrate to `recode_values()` and `replace_values()`. * The `.groups` message emitted by `summarise()` is hopefully more clear now (#6986). * `storms` has been updated to include 2023 and 2024 data (#7111, @tomalrussell). * `if_any()` and `if_all()` are now more consistent in all use cases (#7059, #7077, #7746, @jrwinget). In particular: * When called with zero inputs, `if_any()` returns `FALSE` and `if_all()` returns `TRUE`. * When called with one input, both now return logical vectors rather than the original column. * The result of applying `.fns` now must be a logical vector. * `tally_n()` creates fully qualified funciton calls for duckplyr compatibility (#7046) * Empty `rowwise()` list-column elements now resolve to `logical()` rather than a random logical of length 1 (#7710). * `last_dplyr_warnings()` no longer prevents objects from being garbage collected (#7649). * `case_when()` now throws correctly indexed errors when `NULL`s are supplied in `...` (#7739). * `case_when()` now throws a better error if one of the conditions is an array (#6862, @ilovemane). * `bind_rows()` now replaces empty (or `NA`) element names in a list with its numeric index while preserving existing names (#7719, @Meghansaha). * New `slice_sample()` example showing how to use it to shuffle rows (#7707, @Hzanib). * Updated `across()` examples to include an example using `everything()` (#7621, @JBrandenburg02). * Clarified how `slice_min()` and `slice_max()` work in the introduction vignette (#7717, @ccani007). * Fixed an edge case when coercing data frames to matrices (#7004). * Fixed an issue where duckplyr's ALTREP data frames were being materialized early due to internal usage of `ncol()` (#7049). * Progress towards making dplyr conformant with the public C API of R (#7741, #7797). * R >=4.1.0 is now required, in line with the [tidyverse standard](https://tidyverse.org/blog/2019/04/r-version-support/) of supporting the previous 5 minor releases of R (#7711).
Readable link
Most relevant issues
filter(.missing = )option to optionally retain missing values dplyr#6560filter(.missing = NULL, .how = c("keep", "drop"))dplyr#6891We are open to feedback until Monday, November 24th.