I'm preparing to run Random Forest, but first I need to get my dataset in a form that would be functional for that. As such, I'm trying to fix a few problems with it. I have theoretical solutions, but I don't know how to execute them:
- There are NA gaps in certain columns that I need to fill.
Since the data is a time series that contains half-hours values for over a year, I'm planning on finding the daily average for each day and filling in the NAs with the average for the corresponding day (I realize that this isn't the most statistically sound method, but I just need code that runs at the moment)
- Not all of the variables are numeric (some are dates, some are character)
I plan on removing the character columns en masse (though I don't know what code would accomplish this). However, I need to keep the date columns, but I'm not sure how to maintain them when I'm taking averages of every column
- The dataset is very large (~200 variables, ~17,000 observations)
I'm hoping to use code that takes the averages of all of these columns at once, so that I don't have to manually go through every column.
I have encountered asks about averaging entire columns indiscriminately (instead of grouped by days) and asks about keeping only the numeric variables (but I need to keep the dates as well). These unfortunately won't work.
Below is example data (I know that the TIMESTAMP and dates might be interpreted as characters by R. I don't know how to fix this when giving sample data, but they read as dates in my dataframe):
TIMESTAMP <- c("2019-04-27 17:30:00", "2019-04-27 18:00:00", "2019-04-27 18:30:00", "2019-04-27 19:00:00", "2019-04-27 19:30:00", "2019-04-28 10:00:00", "2019-04-28 10:30:00", "2019-04-28 11:00:00", "2019-04-28 11:30:00", "2019-04-28 12:00:00")
dates<-c("2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28")
ch4_flux <- c(NA, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,NA, 61.54,61.53)
distance <- c(1000,1000,NA,125.35,1000,NA,1000,5.50,NA,1000)
Tau <-c(0.0322000, 0.0495000, 0.1737616, 0.1772567, NA, 0.1246816, 0.1435230, 0.1098670, NA, NA)
filename<- c("2019-04-27T173000_AIU-2079.ghg","2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg","2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg",NA, "2019-04-28T100000_AIU-2079.ghg","2019-04-28T103000_AIU-2079.ghg", "2019-04-28T110000_AIU-2079.ghg",NA)
dd<- data.frame(TIMESTAMP, dates, ch4_flux, distance,Tau, filename)
The head of the original raw dataset is below:
>dput(AMPdates2[1:10, 1:8])
structure(list(TIMESTAMP = structure(c(1556386200, 1556388000,
1556389800, 1556391600, 1556393400, 1556395200, 1556397000, 1556398800,
1556400600, 1556402400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Time_period_check = c(30, 30, 30, 30, 30, 30, 30, 30, 30,
30), filename = c("2019-04-27T170000_AIU-2079.ghg", "2019-04-27T173000_AIU-2079.ghg",
"2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg",
"2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg",
"2019-04-27T200000_AIU-2079.ghg", "2019-04-27T203000_AIU-2079.ghg",
"2019-04-27T210000_AIU-2079.ghg", "2019-04-27T213000_AIU-2079.ghg"
), date = structure(c(1556323200, 1556323200, 1556323200,
1556323200, 1556323200, 1556323200, 1556323200, 1556323200,
1556323200, 1556323200), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
time_adjusted = c("4/27/2019 17:30", "4/27/2019 18:00", "4/27/2019 18:30",
"4/27/2019 19:00", "4/27/2019 19:30", "4/27/2019 20:00",
"4/27/2019 20:30", "4/27/2019 21:00", "4/27/2019 21:30",
"4/27/2019 22:00"), time = c("17:40", "18:10", "18:40", "19:10",
"19:40", "20:10", "20:40", "21:10", "21:40", "22:10"), DOY = c(117.735,
117.756, 117.777, 117.798, 117.819, 117.84, 117.86, 117.881,
117.902, 117.923), daytime = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0)), row.names = c(NA, 10L), class = "data.frame")
I'm preparing to run Random Forest, but first I need to get my dataset in a form that would be functional for that. As such, I'm trying to fix a few problems with it. I have theoretical solutions, but I don't know how to execute them:
- There are NA gaps in certain columns that I need to fill.
Since the data is a time series that contains half-hours values for over a year, I'm planning on finding the daily average for each day and filling in the NAs with the average for the corresponding day (I realize that this isn't the most statistically sound method, but I just need code that runs at the moment)
- Not all of the variables are numeric (some are dates, some are character)
I plan on removing the character columns en masse (though I don't know what code would accomplish this). However, I need to keep the date columns, but I'm not sure how to maintain them when I'm taking averages of every column
- The dataset is very large (~200 variables, ~17,000 observations)
I'm hoping to use code that takes the averages of all of these columns at once, so that I don't have to manually go through every column.
I have encountered asks about averaging entire columns indiscriminately (instead of grouped by days) and asks about keeping only the numeric variables (but I need to keep the dates as well). These unfortunately won't work.
Below is example data (I know that the TIMESTAMP and dates might be interpreted as characters by R. I don't know how to fix this when giving sample data, but they read as dates in my dataframe):
TIMESTAMP <- c("2019-04-27 17:30:00", "2019-04-27 18:00:00", "2019-04-27 18:30:00", "2019-04-27 19:00:00", "2019-04-27 19:30:00", "2019-04-28 10:00:00", "2019-04-28 10:30:00", "2019-04-28 11:00:00", "2019-04-28 11:30:00", "2019-04-28 12:00:00")
dates<-c("2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28")
ch4_flux <- c(NA, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,NA, 61.54,61.53)
distance <- c(1000,1000,NA,125.35,1000,NA,1000,5.50,NA,1000)
Tau <-c(0.0322000, 0.0495000, 0.1737616, 0.1772567, NA, 0.1246816, 0.1435230, 0.1098670, NA, NA)
filename<- c("2019-04-27T173000_AIU-2079.ghg","2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg","2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg",NA, "2019-04-28T100000_AIU-2079.ghg","2019-04-28T103000_AIU-2079.ghg", "2019-04-28T110000_AIU-2079.ghg",NA)
dd<- data.frame(TIMESTAMP, dates, ch4_flux, distance,Tau, filename)
The head of the original raw dataset is below:
>dput(AMPdates2[1:10, 1:8])
structure(list(TIMESTAMP = structure(c(1556386200, 1556388000,
1556389800, 1556391600, 1556393400, 1556395200, 1556397000, 1556398800,
1556400600, 1556402400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Time_period_check = c(30, 30, 30, 30, 30, 30, 30, 30, 30,
30), filename = c("2019-04-27T170000_AIU-2079.ghg", "2019-04-27T173000_AIU-2079.ghg",
"2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg",
"2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg",
"2019-04-27T200000_AIU-2079.ghg", "2019-04-27T203000_AIU-2079.ghg",
"2019-04-27T210000_AIU-2079.ghg", "2019-04-27T213000_AIU-2079.ghg"
), date = structure(c(1556323200, 1556323200, 1556323200,
1556323200, 1556323200, 1556323200, 1556323200, 1556323200,
1556323200, 1556323200), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
time_adjusted = c("4/27/2019 17:30", "4/27/2019 18:00", "4/27/2019 18:30",
"4/27/2019 19:00", "4/27/2019 19:30", "4/27/2019 20:00",
"4/27/2019 20:30", "4/27/2019 21:00", "4/27/2019 21:30",
"4/27/2019 22:00"), time = c("17:40", "18:10", "18:40", "19:10",
"19:40", "20:10", "20:40", "21:10", "21:40", "22:10"), DOY = c(117.735,
117.756, 117.777, 117.798, 117.819, 117.84, 117.86, 117.881,
117.902, 117.923), daytime = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0)), row.names = c(NA, 10L), class = "data.frame")
Share
Improve this question
edited Mar 18 at 21:02
M--
29.5k10 gold badges69 silver badges106 bronze badges
Recognized by R Language Collective
asked Mar 18 at 19:00
shrimpshrimp
1011 silver badge4 bronze badges
0
2 Answers
Reset to default 2library(dplyr)
dd %>%
mutate(DAY_grp = as.Date(TIMESTAMP)) %>%
mutate(across(where(~is.numeric(.)), ~zoo::na.aggregate(.x)),
.by = DAY_grp) %>%
select(-DAY_grp)
#> # A tibble: 10 × 6
#> TIMESTAMP dates ch4_flux distance Tau filename
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 2019-04-27 17:30:00 2019-04-27 64.9 1000 0.0322 2019-04-27T173000_AI…
#> 2 2019-04-27 18:00:00 2019-04-27 66.4 1000 0.0495 2019-04-27T180000_AI…
#> 3 2019-04-27 18:30:00 2019-04-27 65.4 781. 0.174 2019-04-27T183000_AI…
#> 4 2019-04-27 19:00:00 2019-04-27 64.4 125. 0.177 2019-04-27T190000_AI…
#> 5 2019-04-27 19:30:00 2019-04-27 63.5 1000 0.108 2019-04-27T193000_AI…
#> 6 2019-04-28 10:00:00 2019-04-28 62.8 668. 0.125 <NA>
#> 7 2019-04-28 10:30:00 2019-04-28 62.2 1000 0.144 2019-04-28T100000_AI…
#> 8 2019-04-28 11:00:00 2019-04-28 62.0 5.5 0.110 2019-04-28T103000_AI…
#> 9 2019-04-28 11:30:00 2019-04-28 61.5 668. 0.126 2019-04-28T110000_AI…
#> 10 2019-04-28 12:00:00 2019-04-28 61.5 1000 0.126 <NA>
Created on 2025-03-18 with reprex v2.1.1
Here is a fast base R approach based on ave()
:
dd[sapply(dd, is.numeric)] = lapply(Filter(is.numeric, dd), \(x) {
i = is.na(x)
x[i] = ave(x, dd$dates, FUN = \(x) mean(x, na.rm=TRUE))[i]
x
})
> dd
TIMESTAMP dates ch4_flux distance Tau filename
1 2019-04-27 17:30:00 2019-04-27 64.9275 1000.0000 0.0322000 2019-04-27T173000_AIU-2079.ghg
2 2019-04-27 18:00:00 2019-04-27 66.3900 1000.0000 0.0495000 2019-04-27T180000_AIU-2079.ghg
3 2019-04-27 18:30:00 2019-04-27 65.3900 781.3375 0.1737616 2019-04-27T183000_AIU-2079.ghg
4 2019-04-27 19:00:00 2019-04-27 64.4100 125.3500 0.1772567 2019-04-27T190000_AIU-2079.ghg
5 2019-04-27 19:30:00 2019-04-27 63.5200 1000.0000 0.1081796 2019-04-27T193000_AIU-2079.ghg
6 2019-04-28 10:00:00 2019-04-28 62.7600 668.5000 0.1246816 <NA>
7 2019-04-28 10:30:00 2019-04-28 62.1600 1000.0000 0.1435230 2019-04-28T100000_AIU-2079.ghg
8 2019-04-28 11:00:00 2019-04-28 61.9975 5.5000 0.1098670 2019-04-28T103000_AIU-2079.ghg
9 2019-04-28 11:30:00 2019-04-28 61.5400 668.5000 0.1260239 2019-04-28T110000_AIU-2079.ghg
10 2019-04-28 12:00:00 2019-04-28 61.5300 1000.0000 0.1260239 <NA>