最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Reading in a data file with staggered column names into R - Stack Overflow

programmeradmin3浏览0评论

I have a text file with the following format:

* +---------------- Station Code
* |    +----------- Schedule Arrival Day
* |    |  +-------- Schedule Arrival Time
* |    |  |     +----- Schedule Departure Day
* |    |  |     |  +-- Schedule Departure Time
* |    |  |     |  |     +------------- Actual Arrival Time
* |    |  |     |  |     |     +------- Actual Departure Time
* |    |  |     |  |     |     |     +- Comments
* V    V  V     V  V     V     V     V
* NOL  *  *     1  900A  *     900A  Departed:  On time.
* SCH  *  *     1  1030A *     1039A Departed:  9 minutes late.
* NIB  *  *     1  1156A *     1159A Departed:  3 minutes late.
* LFT  *  *     1  1224P *     1228P Departed:  4 minutes late.
* LCH  *  *     1  155P  *     155P  Departed:  On time.

How do I read this into R with the several different column names on different rows?

I have a text file with the following format:

* +---------------- Station Code
* |    +----------- Schedule Arrival Day
* |    |  +-------- Schedule Arrival Time
* |    |  |     +----- Schedule Departure Day
* |    |  |     |  +-- Schedule Departure Time
* |    |  |     |  |     +------------- Actual Arrival Time
* |    |  |     |  |     |     +------- Actual Departure Time
* |    |  |     |  |     |     |     +- Comments
* V    V  V     V  V     V     V     V
* NOL  *  *     1  900A  *     900A  Departed:  On time.
* SCH  *  *     1  1030A *     1039A Departed:  9 minutes late.
* NIB  *  *     1  1156A *     1159A Departed:  3 minutes late.
* LFT  *  *     1  1224P *     1228P Departed:  4 minutes late.
* LCH  *  *     1  155P  *     155P  Departed:  On time.

How do I read this into R with the several different column names on different rows?

Share asked Mar 10 at 19:22 mjc0203mjc0203 356 bronze badges 5
  • One option. Use read.csv with sep = " " and skip = 9 (adjust that to fit your actual use case). Afterwards, wrangle the column names manually and concatenate columns. Even better, change the upstream workflow to give you an easier life in the future. – Limey Commented Mar 10 at 19:32
  • Maybe we can extract the colnames from the rows until the V comes. Then remove all "-"/"*" to get the col_names. Then we find the positions of the Vs and use them for read.fwf(v_positions) since there does not seem to be be a decernable delimeter unless maybe "", but this messes up the last column. @mjc0203 are there many files in this format? Or is it only this one? If it's only one then do what Limey says. – Tim G Commented Mar 10 at 19:51
  • This is an example from this site: dixielandsoftware/Amtrak/status/StatusPages/index.html. They are, unfortunately, all like this :( – mjc0203 Commented Mar 10 at 22:53
  • Ok but when I go to the site and then click on the blue arrow (formatted data) it's transformed to a proper table. I can even write you a function to webscrape this. Then you don't even have to do the downloading as a text file and transforming. – Tim G Commented Mar 10 at 23:33
  • 1 @TimG I was working from the downloaded zipped files that has the unformatted data for each year. – mjc0203 Commented Mar 11 at 16:52
Add a comment  | 

2 Answers 2

Reset to default 1

This is not an entirely easy task, but we might do as @Tim G suggests like so:

# 1. prep work
## read data as lines of text
data_file <- "C:/Users/you/somewhere/data.txt"
data_lines <- readLines(data_file)
## identify which line starts with "* V"
v_line_index <- which(grepl("^\\* V", data_lines))

# 2. generate vector of column names
## from line 1 until the line before the v_line_index
data_col_name_lines <- data_lines[1:(v_line_index - 1)]
## remove any special characters to retain column names
col_names <- gsub("[\\* \\+\\|-]", "", data_col_name_lines)
## and prepend an initial column name
col_names <- c("star", col_names)

# 3. determine column widths:
## split the v_line_index line into individual characters
v_line_index_chars <- strsplit(data_lines[v_line_index], "")[[1]]
## determine which columns have either a "*" or "V"
v_line_v_indexes <- which(v_line_index_chars %in% c("*", "V"))
## calculate column widths by taking differences and appending a long last width
widths = c(diff(v_line_v_indexes), 100)

# 4. finally read the data with `read.fwf()`
read.fwf(
  data_file, 
  widths = widths,
  header = FALSE, 
  skip = 9,
  col.names = col_names
)
#>   star StationCode ScheduleArrivalDay ScheduleArrivalTime ScheduleDepartureDay
#> 1   *        NOL                  *                *                         1
#> 2   *        SCH                  *                *                         1
#> 3   *        NIB                  *                *                         1
#> 4   *        LFT                  *                *                         1
#> 5   *        LCH                  *                *                         1
#>   ScheduleDepartureTime ActualArrivalTime ActualDepartureTime
#> 1                900A              *                   900A  
#> 2                1030A             *                   1039A 
#> 3                1156A             *                   1159A 
#> 4                1224P             *                   1228P 
#> 5                155P              *                   155P  
#>                     Comments
#> 1        Departed:  On time.
#> 2 Departed:  9 minutes late.
#> 3 Departed:  3 minutes late.
#> 4 Departed:  4 minutes late.
#> 5        Departed:  On time.

Created on 2025-03-10 with reprex v2.1.1

I know you're already accepted an answer, but here's an alternative (since it was a fun challenge). The benefit of this approach is not reading all of the file into R's memory with readLines, and only reading the file one time.

  • Read line-by-line until we start seeing the +--- lines, discarding them until we find the first
  • Keep collecting the column-name lines in hdrs, line-by-line, until we see a line of Vs
  • Use the line of Vs to determine the widths for a call to read.fwf (fixed-width format)
  • Used the saved hdrs to define the column names, after discarding the first column (which is an unlabeled *)
con <- file("~/Downloads/86.txt", "r")
while (!grepl(" \\+-+", txt <- readLines(con, n=1))) hdrs <- txt
while (!grepl("\\bV +V\\b", txt <- readLines(con, n=1))) hdrs <- c(hdrs, txt)
Vs <- which(strsplit(txt, "")[[1]] == "V")
widths <- c(diff(c(1, Vs)), 999)
dat <- read.fwf(con, widths = widths, header = FALSE)[,-1] |>
  setNames(trimws(sub(".*- ", "", hdrs)))
dat[] <- lapply(dat, trimws)

head(dat)
#   Station Code Schedule Arrival Day Schedule Arrival Time Schedule Departure Day Schedule Departure Time
# 1          RVM                    *                     *                      1                    535A
# 2          RVR                    *                     *                      1                    605A
# 3          ASD                    *                     *                      1                    619A
# 4          FBG                    *                     *                      1                    702A
# 5          QAN                    *                     *                      1                    722A
# 6          WDB                    *                     *                      1                    734A
#   Actual Arrival Time Actual Departure Time                   Comments
# 1                   *                  535A        Departed:  On time.
# 2                   *                  605A        Departed:  On time.
# 3                   *                  619A        Departed:  On time.
# 4                   *                  703A  Departed:  1 minute late.
# 5                   *                  725A Departed:  3 minutes late.
# 6                   *                  741A Departed:  7 minutes late.

Ways to improve/change/ruggedize this:

  • 999 is a magic constant, do with it what you will

  • the bottom of the file I downloaded (https://dixielandsoftware/cgi-bin/gettrain.pl?seltrain=86&selyear=2025&selmonth=03&selday=10) went unstructured ... in this case, the read.fwf stopped on the empty line before it, there might be "junk" if it doesn't stop or if there is no blank line before the different structure

  • I'm assuming the first-column * is throw-away

  • Perhaps +--- is presumptuous to find the first column name?

  • As is often the case with "blind" while loops, if the file is very long and not at all structured this way, this will relatively-slowly slog through the file line by line until it hits EOF and then likely fail trying to read/parse a closed file connection, so each step might include a little defensive check before continuing.

  • The regexes seem safe enough, perhaps they can be hardened a little in case column names are maliciously formed.

  • R doesn't complain about the names we've assigned, but they aren't terribly R-like ... feel free to massage hdrs however you like, perhaps forming PascalCaseNames as in the other answer or some other unambiguous cleaning.

  • I did not assume it, but it's not hard to imagine adding lapply(type.convert, as.is=TRUE) to the |>-pipe so that ints become ints, etc. I don't think the sample data I saw screams this, but it's an option in general.

    dat[] <- lapply(dat, trimws) |> lapply(type.convert, as.is = TRUE)
    
发布评论

评论列表(0)

  1. 暂无评论