I have a text file with the following format:
* +---------------- Station Code
* | +----------- Schedule Arrival Day
* | | +-------- Schedule Arrival Time
* | | | +----- Schedule Departure Day
* | | | | +-- Schedule Departure Time
* | | | | | +------------- Actual Arrival Time
* | | | | | | +------- Actual Departure Time
* | | | | | | | +- Comments
* V V V V V V V V
* NOL * * 1 900A * 900A Departed: On time.
* SCH * * 1 1030A * 1039A Departed: 9 minutes late.
* NIB * * 1 1156A * 1159A Departed: 3 minutes late.
* LFT * * 1 1224P * 1228P Departed: 4 minutes late.
* LCH * * 1 155P * 155P Departed: On time.
How do I read this into R with the several different column names on different rows?
I have a text file with the following format:
* +---------------- Station Code
* | +----------- Schedule Arrival Day
* | | +-------- Schedule Arrival Time
* | | | +----- Schedule Departure Day
* | | | | +-- Schedule Departure Time
* | | | | | +------------- Actual Arrival Time
* | | | | | | +------- Actual Departure Time
* | | | | | | | +- Comments
* V V V V V V V V
* NOL * * 1 900A * 900A Departed: On time.
* SCH * * 1 1030A * 1039A Departed: 9 minutes late.
* NIB * * 1 1156A * 1159A Departed: 3 minutes late.
* LFT * * 1 1224P * 1228P Departed: 4 minutes late.
* LCH * * 1 155P * 155P Departed: On time.
How do I read this into R with the several different column names on different rows?
Share asked Mar 10 at 19:22 mjc0203mjc0203 356 bronze badges 5 |2 Answers
Reset to default 1This is not an entirely easy task, but we might do as @Tim G suggests like so:
# 1. prep work
## read data as lines of text
data_file <- "C:/Users/you/somewhere/data.txt"
data_lines <- readLines(data_file)
## identify which line starts with "* V"
v_line_index <- which(grepl("^\\* V", data_lines))
# 2. generate vector of column names
## from line 1 until the line before the v_line_index
data_col_name_lines <- data_lines[1:(v_line_index - 1)]
## remove any special characters to retain column names
col_names <- gsub("[\\* \\+\\|-]", "", data_col_name_lines)
## and prepend an initial column name
col_names <- c("star", col_names)
# 3. determine column widths:
## split the v_line_index line into individual characters
v_line_index_chars <- strsplit(data_lines[v_line_index], "")[[1]]
## determine which columns have either a "*" or "V"
v_line_v_indexes <- which(v_line_index_chars %in% c("*", "V"))
## calculate column widths by taking differences and appending a long last width
widths = c(diff(v_line_v_indexes), 100)
# 4. finally read the data with `read.fwf()`
read.fwf(
data_file,
widths = widths,
header = FALSE,
skip = 9,
col.names = col_names
)
#> star StationCode ScheduleArrivalDay ScheduleArrivalTime ScheduleDepartureDay
#> 1 * NOL * * 1
#> 2 * SCH * * 1
#> 3 * NIB * * 1
#> 4 * LFT * * 1
#> 5 * LCH * * 1
#> ScheduleDepartureTime ActualArrivalTime ActualDepartureTime
#> 1 900A * 900A
#> 2 1030A * 1039A
#> 3 1156A * 1159A
#> 4 1224P * 1228P
#> 5 155P * 155P
#> Comments
#> 1 Departed: On time.
#> 2 Departed: 9 minutes late.
#> 3 Departed: 3 minutes late.
#> 4 Departed: 4 minutes late.
#> 5 Departed: On time.
Created on 2025-03-10 with reprex v2.1.1
I know you're already accepted an answer, but here's an alternative (since it was a fun challenge). The benefit of this approach is not reading all of the file into R's memory with readLines
, and only reading the file one time.
- Read line-by-line until we start seeing the
+---
lines, discarding them until we find the first - Keep collecting the column-name lines in
hdrs
, line-by-line, until we see a line ofV
s - Use the line of
V
s to determine the widths for a call toread.fwf
(fixed-width format) - Used the saved
hdrs
to define the column names, after discarding the first column (which is an unlabeled*
)
con <- file("~/Downloads/86.txt", "r")
while (!grepl(" \\+-+", txt <- readLines(con, n=1))) hdrs <- txt
while (!grepl("\\bV +V\\b", txt <- readLines(con, n=1))) hdrs <- c(hdrs, txt)
Vs <- which(strsplit(txt, "")[[1]] == "V")
widths <- c(diff(c(1, Vs)), 999)
dat <- read.fwf(con, widths = widths, header = FALSE)[,-1] |>
setNames(trimws(sub(".*- ", "", hdrs)))
dat[] <- lapply(dat, trimws)
head(dat)
# Station Code Schedule Arrival Day Schedule Arrival Time Schedule Departure Day Schedule Departure Time
# 1 RVM * * 1 535A
# 2 RVR * * 1 605A
# 3 ASD * * 1 619A
# 4 FBG * * 1 702A
# 5 QAN * * 1 722A
# 6 WDB * * 1 734A
# Actual Arrival Time Actual Departure Time Comments
# 1 * 535A Departed: On time.
# 2 * 605A Departed: On time.
# 3 * 619A Departed: On time.
# 4 * 703A Departed: 1 minute late.
# 5 * 725A Departed: 3 minutes late.
# 6 * 741A Departed: 7 minutes late.
Ways to improve/change/ruggedize this:
999
is a magic constant, do with it what you willthe bottom of the file I downloaded (https://dixielandsoftware/cgi-bin/gettrain.pl?seltrain=86&selyear=2025&selmonth=03&selday=10) went unstructured ... in this case, the
read.fwf
stopped on the empty line before it, there might be "junk" if it doesn't stop or if there is no blank line before the different structureI'm assuming the first-column
*
is throw-awayPerhaps
+---
is presumptuous to find the first column name?As is often the case with "blind"
while
loops, if the file is very long and not at all structured this way, this will relatively-slowly slog through the file line by line until it hits EOF and then likely fail trying to read/parse a closed file connection, so each step might include a little defensive check before continuing.The regexes seem safe enough, perhaps they can be hardened a little in case column names are maliciously formed.
R doesn't complain about the names we've assigned, but they aren't terribly R-like ... feel free to massage
hdrs
however you like, perhaps forming PascalCaseNames as in the other answer or some other unambiguous cleaning.I did not assume it, but it's not hard to imagine adding
lapply(type.convert, as.is=TRUE)
to the|>
-pipe so that ints become ints, etc. I don't think the sample data I saw screams this, but it's an option in general.dat[] <- lapply(dat, trimws) |> lapply(type.convert, as.is = TRUE)
read.csv
withsep = " "
andskip = 9
(adjust that to fit your actual use case). Afterwards, wrangle the column names manually and concatenate columns. Even better, change the upstream workflow to give you an easier life in the future. – Limey Commented Mar 10 at 19:32read.fwf(v_positions)
since there does not seem to be be a decernable delimeter unless maybe "", but this messes up the last column. @mjc0203 are there many files in this format? Or is it only this one? If it's only one then do what Limey says. – Tim G Commented Mar 10 at 19:51