I have a very large dataset that I need help processing conditionally
Here is a snippet:
structure(list(month = c(202407L, 202408L, 202409L, 202410L,
202411L, 202412L, 202501L), item = c("A02", "A02", "A02", "A02",
"A02", "A02", "A02"), estimate = c("35.79", "", "35.79",
"35.79", "", "", ""), cycle = c("M", "O", "O", "M", "O", "O",
"M"), rsp = c(1, 1, 1, 1, 0, 0, 0)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), groups = structure(list(
item = "A02", .rows = structure(list(1:7), ptype = integer(0), class =
c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L), .drop = TRUE))
What I would like to do is create a variable out
that behaves like so:
If
cycle
is M andestimate
is""
, thenout
is 1If
cycle
is M andestimate
is not""
, thenout
is 0If
cycle
is O andestimate
is""
, then, look back at most recent row wherecycle
is MIf, in that month,
estimate
is""
, thenout
= 1If
estimate
is not""
, thenout
= 0.
Here is what I tried:
group_by(item) |>
mutate(out = case_when(cycle == "M" & estimate != "" ~ 0,
cycle == "M"& estimate == "" ~ 1,
cycle == "O" & which.max(cycle == "M" & estimate == "") ~ 1,
.default = 0))
Desired Result--The column out
should look like so: c(0, 0, 0, 0, 0, 0, 1)
Note: the dataframe is grouped and that number of rows between each cycle==M
entry is not necessarily fixed.
Any help is much appreciated. I tried using case_when
because I figured dplyr
would be the easiest way to do this but other suggestions are more than welcome.
Thanks.
I have a very large dataset that I need help processing conditionally
Here is a snippet:
structure(list(month = c(202407L, 202408L, 202409L, 202410L,
202411L, 202412L, 202501L), item = c("A02", "A02", "A02", "A02",
"A02", "A02", "A02"), estimate = c("35.79", "", "35.79",
"35.79", "", "", ""), cycle = c("M", "O", "O", "M", "O", "O",
"M"), rsp = c(1, 1, 1, 1, 0, 0, 0)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), groups = structure(list(
item = "A02", .rows = structure(list(1:7), ptype = integer(0), class =
c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L), .drop = TRUE))
What I would like to do is create a variable out
that behaves like so:
If
cycle
is M andestimate
is""
, thenout
is 1If
cycle
is M andestimate
is not""
, thenout
is 0If
cycle
is O andestimate
is""
, then, look back at most recent row wherecycle
is MIf, in that month,
estimate
is""
, thenout
= 1If
estimate
is not""
, thenout
= 0.
Here is what I tried:
group_by(item) |>
mutate(out = case_when(cycle == "M" & estimate != "" ~ 0,
cycle == "M"& estimate == "" ~ 1,
cycle == "O" & which.max(cycle == "M" & estimate == "") ~ 1,
.default = 0))
Desired Result--The column out
should look like so: c(0, 0, 0, 0, 0, 0, 1)
Note: the dataframe is grouped and that number of rows between each cycle==M
entry is not necessarily fixed.
Any help is much appreciated. I tried using case_when
because I figured dplyr
would be the easiest way to do this but other suggestions are more than welcome.
Thanks.
Share Improve this question asked yesterday jvalentijvalenti 6401 gold badge10 silver badges32 bronze badges1 Answer
Reset to default 1Fill up the cycle == "M"
values first and then use tidyr::fill
to get the most recent values copied at cycle == "O"
.
library(dplyr)
df %>%
mutate(out = case_when(cycle == "M" & estimate == "" ~ 1,
cycle == "M" & estimate != "" ~ 0)) %>%
tidyr::fill(out)
# A tibble: 7 × 6
# Groups: item [1]
# month item estimate cycle rsp out
# <int> <chr> <chr> <chr> <dbl> <dbl>
#1 202407 A02 "35.79" M 1 0
#2 202408 A02 "" O 1 0
#3 202409 A02 "35.79" O 1 0
#4 202410 A02 "35.79" M 1 0
#5 202411 A02 "" O 0 0
#6 202412 A02 "" O 0 0
#7 202501 A02 "" M 0 1