r - Fill Gaps By Taking Column Daily Averages

I'm preparing to run Random Forest, but first I need to get my dataset in a form that would be functional for that. As such, I'm trying to fix a few problems with it. I have theoretical solutions, but I don't know how to execute them:

There are NA gaps in certain columns that I need to fill.

Since the data is a time series that contains half-hours values for over a year, I'm planning on finding the daily average for each day and filling in the NAs with the average for the corresponding day (I realize that this isn't the most statistically sound method, but I just need code that runs at the moment)

Not all of the variables are numeric (some are dates, some are character)

I plan on removing the character columns en masse (though I don't know what code would accomplish this). However, I need to keep the date columns, but I'm not sure how to maintain them when I'm taking averages of every column

The dataset is very large (~200 variables, ~17,000 observations)

I'm hoping to use code that takes the averages of all of these columns at once, so that I don't have to manually go through every column.

I have encountered asks about averaging entire columns indiscriminately (instead of grouped by days) and asks about keeping only the numeric variables (but I need to keep the dates as well). These unfortunately won't work.

Below is example data (I know that the TIMESTAMP and dates might be interpreted as characters by R. I don't know how to fix this when giving sample data, but they read as dates in my dataframe):

TIMESTAMP <- c("2019-04-27 17:30:00", "2019-04-27 18:00:00", "2019-04-27 18:30:00", "2019-04-27 19:00:00", "2019-04-27 19:30:00", "2019-04-28 10:00:00", "2019-04-28 10:30:00", "2019-04-28 11:00:00", "2019-04-28 11:30:00", "2019-04-28 12:00:00")
dates<-c("2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28")
ch4_flux <- c(NA, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,NA, 61.54,61.53)
distance <- c(1000,1000,NA,125.35,1000,NA,1000,5.50,NA,1000)
Tau <-c(0.0322000, 0.0495000, 0.1737616, 0.1772567, NA, 0.1246816, 0.1435230, 0.1098670, NA, NA)
filename<- c("2019-04-27T173000_AIU-2079.ghg","2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg","2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg",NA,  "2019-04-28T100000_AIU-2079.ghg","2019-04-28T103000_AIU-2079.ghg", "2019-04-28T110000_AIU-2079.ghg",NA)

dd<- data.frame(TIMESTAMP, dates, ch4_flux, distance,Tau, filename)

The head of the original raw dataset is below:

>dput(AMPdates2[1:10, 1:8])

structure(list(TIMESTAMP = structure(c(1556386200, 1556388000, 
1556389800, 1556391600, 1556393400, 1556395200, 1556397000, 1556398800, 
1556400600, 1556402400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Time_period_check = c(30, 30, 30, 30, 30, 30, 30, 30, 30, 
    30), filename = c("2019-04-27T170000_AIU-2079.ghg", "2019-04-27T173000_AIU-2079.ghg", 
    "2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg", 
    "2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg", 
    "2019-04-27T200000_AIU-2079.ghg", "2019-04-27T203000_AIU-2079.ghg", 
    "2019-04-27T210000_AIU-2079.ghg", "2019-04-27T213000_AIU-2079.ghg"
    ), date = structure(c(1556323200, 1556323200, 1556323200, 
    1556323200, 1556323200, 1556323200, 1556323200, 1556323200, 
    1556323200, 1556323200), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    time_adjusted = c("4/27/2019 17:30", "4/27/2019 18:00", "4/27/2019 18:30", 
    "4/27/2019 19:00", "4/27/2019 19:30", "4/27/2019 20:00", 
    "4/27/2019 20:30", "4/27/2019 21:00", "4/27/2019 21:30", 
    "4/27/2019 22:00"), time = c("17:40", "18:10", "18:40", "19:10", 
    "19:40", "20:10", "20:40", "21:10", "21:40", "22:10"), DOY = c(117.735, 
    117.756, 117.777, 117.798, 117.819, 117.84, 117.86, 117.881, 
    117.902, 117.923), daytime = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 
    0)), row.names = c(NA, 10L), class = "data.frame")

There are NA gaps in certain columns that I need to fill.

Not all of the variables are numeric (some are dates, some are character)

The dataset is very large (~200 variables, ~17,000 observations)

I'm hoping to use code that takes the averages of all of these columns at once, so that I don't have to manually go through every column.

Below is example data (I know that the TIMESTAMP and dates might be interpreted as characters by R. I don't know how to fix this when giving sample data, but they read as dates in my dataframe):

TIMESTAMP <- c("2019-04-27 17:30:00", "2019-04-27 18:00:00", "2019-04-27 18:30:00", "2019-04-27 19:00:00", "2019-04-27 19:30:00", "2019-04-28 10:00:00", "2019-04-28 10:30:00", "2019-04-28 11:00:00", "2019-04-28 11:30:00", "2019-04-28 12:00:00")
dates<-c("2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28")
ch4_flux <- c(NA, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,NA, 61.54,61.53)
distance <- c(1000,1000,NA,125.35,1000,NA,1000,5.50,NA,1000)
Tau <-c(0.0322000, 0.0495000, 0.1737616, 0.1772567, NA, 0.1246816, 0.1435230, 0.1098670, NA, NA)
filename<- c("2019-04-27T173000_AIU-2079.ghg","2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg","2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg",NA,  "2019-04-28T100000_AIU-2079.ghg","2019-04-28T103000_AIU-2079.ghg", "2019-04-28T110000_AIU-2079.ghg",NA)

dd<- data.frame(TIMESTAMP, dates, ch4_flux, distance,Tau, filename)

The head of the original raw dataset is below:

>dput(AMPdates2[1:10, 1:8])

structure(list(TIMESTAMP = structure(c(1556386200, 1556388000, 
1556389800, 1556391600, 1556393400, 1556395200, 1556397000, 1556398800, 
1556400600, 1556402400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Time_period_check = c(30, 30, 30, 30, 30, 30, 30, 30, 30, 
    30), filename = c("2019-04-27T170000_AIU-2079.ghg", "2019-04-27T173000_AIU-2079.ghg", 
    "2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg", 
    "2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg", 
    "2019-04-27T200000_AIU-2079.ghg", "2019-04-27T203000_AIU-2079.ghg", 
    "2019-04-27T210000_AIU-2079.ghg", "2019-04-27T213000_AIU-2079.ghg"
    ), date = structure(c(1556323200, 1556323200, 1556323200, 
    1556323200, 1556323200, 1556323200, 1556323200, 1556323200, 
    1556323200, 1556323200), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    time_adjusted = c("4/27/2019 17:30", "4/27/2019 18:00", "4/27/2019 18:30", 
    "4/27/2019 19:00", "4/27/2019 19:30", "4/27/2019 20:00", 
    "4/27/2019 20:30", "4/27/2019 21:00", "4/27/2019 21:30", 
    "4/27/2019 22:00"), time = c("17:40", "18:10", "18:40", "19:10", 
    "19:40", "20:10", "20:40", "21:10", "21:40", "22:10"), DOY = c(117.735, 
    117.756, 117.777, 117.798, 117.819, 117.84, 117.86, 117.881, 
    117.902, 117.923), daytime = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 
    0)), row.names = c(NA, 10L), class = "data.frame")

Share Improve this question edited Mar 18 at 21:02 M-- 29.5k10 gold badges69 silver badges106 bronze badges Recognized by R Language Collective asked Mar 18 at 19:00 shrimp 1011 silver badge4 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 2

library(dplyr)

dd %>% 
  mutate(DAY_grp = as.Date(TIMESTAMP)) %>% 
  mutate(across(where(~is.numeric(.)), ~zoo::na.aggregate(.x)), 
         .by = DAY_grp) %>% 
  select(-DAY_grp)

#> # A tibble: 10 × 6
#>    TIMESTAMP           dates      ch4_flux distance    Tau filename             
#>    <chr>               <chr>         <dbl>    <dbl>  <dbl> <chr>                
#>  1 2019-04-27 17:30:00 2019-04-27     64.9   1000   0.0322 2019-04-27T173000_AI…
#>  2 2019-04-27 18:00:00 2019-04-27     66.4   1000   0.0495 2019-04-27T180000_AI…
#>  3 2019-04-27 18:30:00 2019-04-27     65.4    781.  0.174  2019-04-27T183000_AI…
#>  4 2019-04-27 19:00:00 2019-04-27     64.4    125.  0.177  2019-04-27T190000_AI…
#>  5 2019-04-27 19:30:00 2019-04-27     63.5   1000   0.108  2019-04-27T193000_AI…
#>  6 2019-04-28 10:00:00 2019-04-28     62.8    668.  0.125  <NA>                 
#>  7 2019-04-28 10:30:00 2019-04-28     62.2   1000   0.144  2019-04-28T100000_AI…
#>  8 2019-04-28 11:00:00 2019-04-28     62.0      5.5 0.110  2019-04-28T103000_AI…
#>  9 2019-04-28 11:30:00 2019-04-28     61.5    668.  0.126  2019-04-28T110000_AI…
#> 10 2019-04-28 12:00:00 2019-04-28     61.5   1000   0.126  <NA>

^{Created on 2025-03-18 with reprex v2.1.1}

Here is a fast base R approach based on ave():

dd[sapply(dd, is.numeric)] = lapply(Filter(is.numeric, dd), \(x) {
  i = is.na(x)
  x[i] = ave(x, dd$dates, FUN = \(x) mean(x, na.rm=TRUE))[i]
  x 
})

> dd
             TIMESTAMP      dates ch4_flux  distance       Tau                       filename
1  2019-04-27 17:30:00 2019-04-27  64.9275 1000.0000 0.0322000 2019-04-27T173000_AIU-2079.ghg
2  2019-04-27 18:00:00 2019-04-27  66.3900 1000.0000 0.0495000 2019-04-27T180000_AIU-2079.ghg
3  2019-04-27 18:30:00 2019-04-27  65.3900  781.3375 0.1737616 2019-04-27T183000_AIU-2079.ghg
4  2019-04-27 19:00:00 2019-04-27  64.4100  125.3500 0.1772567 2019-04-27T190000_AIU-2079.ghg
5  2019-04-27 19:30:00 2019-04-27  63.5200 1000.0000 0.1081796 2019-04-27T193000_AIU-2079.ghg
6  2019-04-28 10:00:00 2019-04-28  62.7600  668.5000 0.1246816                           <NA>
7  2019-04-28 10:30:00 2019-04-28  62.1600 1000.0000 0.1435230 2019-04-28T100000_AIU-2079.ghg
8  2019-04-28 11:00:00 2019-04-28  61.9975    5.5000 0.1098670 2019-04-28T103000_AIU-2079.ghg
9  2019-04-28 11:30:00 2019-04-28  61.5400  668.5000 0.1260239 2019-04-28T110000_AIU-2079.ghg
10 2019-04-28 12:00:00 2019-04-28  61.5300 1000.0000 0.1260239                           <NA>

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - Fill Gaps By Taking Column Daily Averages - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)