I have a large data frame with a lot of variables measured at three time points t1, t2 and t3. I only want to impute those missings where the according time point was answered at all, that is where answered_t1, t2 or t3 = 1. I tried to specify this using "where", but after doing so, it still leaves missing values where there should be imputations. Either I am doing something really wrong, or the mice algorithm is behaving unexpectedly. Here is a simplified example:
# Load necessary library
library(mice)
# Set seed for reproducibility
set.seed(42)
# Number of participants
n <- 100
# Answered indicators (no missing data)
answered_t1 <- rbinom(n, 1, 0.8)
answered_t2 <- rbinom(n, 1, 0.7)
answered_t3 <- rbinom(n, 1, 0.6)
# Create function to generate variable with ~20% missing data
generate_var <- function(answered) {
var <- rnorm(n)
var[!answered] <- NA # Set entire time point to NA if not answered
missing_idx <- sample(which(answered == 1), size = round(0.2 * sum(answered)))
var[missing_idx] <- NA
return(var)
}
# Generate variables according to answered indicators
TU1_t1 <- generate_var(answered_t1)
TU1_t2 <- generate_var(answered_t2)
TU1_t3 <- generate_var(answered_t3)
TU2_t1 <- generate_var(answered_t1)
TU2_t2 <- generate_var(answered_t2)
TU2_t3 <- generate_var(answered_t3)
# Create the data frame
df <- data.frame(
TU1_t1, TU1_t2, TU1_t3,
TU2_t1, TU2_t2, TU2_t3,
answered_t1, answered_t2, answered_t3
)
# Create the predictor matrix and specify that "answered" variables are not used as predictors
pred <- make.predictorMatrix(df)
pred[, grep("answered", colnames(pred))] <- 0
# Create a "where" matrix to specify where imputation should occur
where <- is.na(df)
# Only allow imputation where the corresponding "answered" variable is 1
where[, "TU1_t1"] <- where[, "TU1_t1"] & df$answered_t1 == 1
where[, "TU1_t2"] <- where[, "TU1_t2"] & df$answered_t2 == 1
where[, "TU1_t3"] <- where[, "TU1_t3"] & df$answered_t3 == 1
where[, "TU2_t1"] <- where[, "TU2_t1"] & df$answered_t1 == 1
where[, "TU2_t2"] <- where[, "TU2_t2"] & df$answered_t2 == 1
where[, "TU2_t3"] <- where[, "TU2_t3"] & df$answered_t3 == 1
# Perform multiple imputation using mice with the "where" matrix
imp <- mice(df, m = 5, predictorMatrix = pred, where = where, printFlag = FALSE)
# Check the completed data for a case where there is a missing value that should be imputed but isn't
completed_data <- complete(imp)
print(completed_data[3, ])
where[3, ]
Does anyone have an idea?