I have a file that sometimes has additional values. Most people have a single SSN, but some people have multiple SSNs and they are separated on multiple lines. What is the regex I need for this? My desired output is an R list of SSNs.
The source file was read in using read_lines() as object f:
dput(f)
c("------------ Phase 1 ------------", "DOB: 12/23/23", "SSN: 123456766",
" 123456777", " 123456788", " 123456799",
" 123456700", "Address: 5 Green Lane", "", "", "------------ Phase 2 ------------",
"DOB: 12/33/23", "SSN: 223456766", "Address: 22 Blue Lane",
"")
and as shown in the text file as:
------------ Phase 1 ------------
DOB: 12/23/23
SSN: 123456766
123456777
123456788
123456799
123456700
Address: 5 Green Lane
------------ Phase 2 ------------
DOB: 12/33/23
SSN: 223456766
Address: 22 Blue Lane
My current regex is: "SSN:\s+\d{9}(\n\s\d{9})?" and I've tried various regex options like dotall and multiline without success.
Regarding output structure, my preference is as simple as possible, ideally as a data.frame. (I'm looking for suggestions from you as to best practices). I realize that's awkward for multiples in a data.frame as it would have to extend across multiple columns. Otherwise a list would be able to handle multiples, or a single cell in a data.frame separated by commas.
Thanks
I have a file that sometimes has additional values. Most people have a single SSN, but some people have multiple SSNs and they are separated on multiple lines. What is the regex I need for this? My desired output is an R list of SSNs.
The source file was read in using read_lines() as object f:
dput(f)
c("------------ Phase 1 ------------", "DOB: 12/23/23", "SSN: 123456766",
" 123456777", " 123456788", " 123456799",
" 123456700", "Address: 5 Green Lane", "", "", "------------ Phase 2 ------------",
"DOB: 12/33/23", "SSN: 223456766", "Address: 22 Blue Lane",
"")
and as shown in the text file as:
------------ Phase 1 ------------
DOB: 12/23/23
SSN: 123456766
123456777
123456788
123456799
123456700
Address: 5 Green Lane
------------ Phase 2 ------------
DOB: 12/33/23
SSN: 223456766
Address: 22 Blue Lane
My current regex is: "SSN:\s+\d{9}(\n\s\d{9})?" and I've tried various regex options like dotall and multiline without success.
Regarding output structure, my preference is as simple as possible, ideally as a data.frame. (I'm looking for suggestions from you as to best practices). I realize that's awkward for multiples in a data.frame as it would have to extend across multiple columns. Otherwise a list would be able to handle multiples, or a single cell in a data.frame separated by commas.
Thanks
Share Improve this question edited Mar 25 at 18:10 hackR asked Mar 25 at 17:25 hackRhackR 1,51117 silver badges26 bronze badges 2 |5 Answers
Reset to default 6Perhaps "SSN:"
for a positive look-behind and all digits + whitespace that will be squished and split in next steps?
library(stringr)
txt <-
"------------ Phase 1 ------------
DOB: 12/23/23
SSN: 123456766
123456777
123456788
123456799
123456700
Address: 5 Green Lane
------------ Phase 2 ------------
DOB: 12/33/23
SSN: 223456766
Address: 22 Blue Lane
"
# in case of lines vector from readLines() / readr::read_lines(),
# first collapse to a single string:
# txt <- paste0(f, collapse = "\n")
# or use readr::read_file()
str_extract_all(txt, "(?<=SSN:)[\\s\\d]+", simplify = TRUE) |>
str_squish() |>
str_split(" ")
#> [[1]]
#> [1] "123456766" "123456777" "123456788" "123456799" "123456700"
#>
#> [[2]]
#> [1] "223456766"
Created on 2025-03-25 with reprex v2.1.1
This gives a data frame where the SSN columnn is a list of character vectors, one character vector per row.
If you would prefer a single \n-separated string in each SSN component omit the last line.
If you would prefer a comma separated string for the SSNs then replace the last line with DF$SSN <- chartr("\n", ",", DF$SSN)
If you would prefer that each row be repeated for each SSN then replace the last line with tidyr::separate_longer_delim(DF, SSN, delim = "\n")
The only regexes used are "^--" in grep to remove the lines beginning with minus signs and the "\n" in strsplit
. We could get rid of the first with an appropriate use of startsWith
and the second by adding the fixed=TRUE
argument to strsplit
but these regexes seems simple enough that we leave them as is.
DF <- read.dcf(textConnection(grep("^--", f, value = TRUE, invert = TRUE))) |>
as.data.frame()
DF$SSN <- strsplit(DF$SSN, "\n")
I'll throw my hat into the ring ...
If there's a chance that any of the other fields (not just SSN:
) could be > 1
, then this method produces consistent set of values. There are a few options:
Option 1: simplest, long-format:
library(dplyr)
library(tidyr) # separate_wider_delim, fill
# txt <- c(...)
L <- split(txt, cumsum(grepl("^-", txt)))
names(L) <- sapply(L, `[[`, 1)
out <- lapply(L, function(x) data.frame(z=x[-1])) |>
bind_rows(.id = "phase") |>
separate_wider_delim(z, delim = ":", names = c("x", "y"), too_few = "align_end") |>
fill(x) |>
mutate(y = trimws(y)) |>
filter(nzchar(y) & !is.na(y))
out
# # A tibble: 10 × 3
# phase x y
# <chr> <chr> <chr>
# 1 ------------ Phase 1 ------------ DOB 12/23/23
# 2 ------------ Phase 1 ------------ SSN 123456766
# 3 ------------ Phase 1 ------------ SSN 123456777
# 4 ------------ Phase 1 ------------ SSN 123456788
# 5 ------------ Phase 1 ------------ SSN 123456799
# 6 ------------ Phase 1 ------------ SSN 123456700
# 7 ------------ Phase 1 ------------ Address 5 Green Lane
# 8 ------------ Phase 2 ------------ DOB 12/33/23
# 9 ------------ Phase 2 ------------ SSN 223456766
# 10 ------------ Phase 2 ------------ Address 22 Blue Lane
Option 2: if you want exactly 1 row for each phase/SSN, then you can list-column the y
column:
out <- summarize(out, .by = c(phase, x), across(everything(), list))
out
# # A tibble: 6 × 3
# phase x y
# <chr> <chr> <list>
# 1 ------------ Phase 1 ------------ DOB <chr [1]>
# 2 ------------ Phase 1 ------------ SSN <chr [5]>
# 3 ------------ Phase 1 ------------ Address <chr [1]>
# 4 ------------ Phase 2 ------------ DOB <chr [1]>
# 5 ------------ Phase 2 ------------ SSN <chr [1]>
# 6 ------------ Phase 2 ------------ Address <chr [1]>
where each value in my (unimpressively-named) y
column can be length 1+:
out$y[2]
# [[1]]
# [1] "123456766" "123456777" "123456788" "123456799" "123456700"
Option 3: if the "fields" (DOB, SSN, etc) are consistent enough and you want it in a wider format, once can pivot this:
pivot_wider(out, id_cols = phase, names_from = "x", values_from = "y")
# # A tibble: 2 × 4
# phase DOB SSN Address
# <chr> <list> <list> <list>
# 1 ------------ Phase 1 ------------ <chr [1]> <chr [5]> <chr [1]>
# 2 ------------ Phase 2 ------------ <chr [1]> <chr [1]> <chr [1]>
Caveat: options 2 and 3 are predicated on your tools (and your brain) being able to process the list-columns of a nested frame. Sometimes it's an "efficiency" or "tidy-models" kind of thing, other times it's purely personal preference.
1 read.fwf
For the given format, you can use read.fwf
library(tidyverse)
read.fwf(textConnection(text), widths = c(8, 25)) %>%
filter(!grepl("^--", V1)) %>% na.omit() %>%
mutate(Phase = cumsum(grepl("DOB:", V1)),
V1 = gsub(":","",na_if(trimws(V1), ""))) %>% fill(V1)
V1 V2 Phase
1 DOB 12/23/23 1
2 SSN 123456766 1
3 SSN 123456777 1
4 SSN 123456788 1
5 SSN 123456799 1
6 SSN 123456700 1
7 Address 5 Green Lane 1
10 DOB 12/33/23 2
11 SSN 223456766 2
12 Address 22 Blue Lane 2
2 Gsub
This questions was originally about the SSN numbers, not about the whole data structure. Assuming you are only interested in the SSN numbers as vector, you can use grep
to find the rows that start with "SSN" and "Address" and read the lines in between, remove SSN:
and one or many whitespaces \\s+
using gsub
lines <- readLines(textConnection(text))
unlist(mapply(function(start, end) {
gsub("\\s+", "", gsub("SSN:", "", lines[start:(end - 1)]))
}, grep("SSN:", lines), grep("Address:", lines)))
[1]Green "123456766"Lane "123456777" "123456788" "123456799" "123456700" "223456766"
Or to get a list remove the unlist()
mapply(function(start, end) {
gsub("\\s+", "", gsub("SSN:", "", lines[start:(end - 1)]))
}, grep("SSN:", lines), grep("Address:", lines))
[[1]]
[1] "123456766" "123456777" "123456788" "123456799"223456766 "123456700"
[[2]]
[1]12 "223456766"
Test data
text <- "------------ Phase 1 ------------
DOB: 12/23/23
SSN: 123456766
123456777
123456788
123456799
123456700
Address: 5 Green Lane
------------ Phase 2 ------------
DOB: 12/33/23
SSN: 223456766
Address: 22 Blue Lane "
I don't know if this would be helpful. However, this is a way you can squeeze all social security numbers inside one element starting with SSN:
followed by a comma separated list using regex.
REGEX PATTERN (Java 8 flavor):
(?<=\d{9})[\s\",]*(\d{9})
REPLACEMENT STRING:
, $1
Regex Demo: https://regex101/r/JB79X9/9
TEST STRING (Added Phase 3):
c("------------ Phase 1 ------------", "DOB: 12/23/23", "SSN: 123456766",
" 123456777", " 123456788", " 123456799",
" 123456700", "Address: 5 Green Lane", "", "", "------------ Phase 2 ------------",
"DOB: 12/33/23", "SSN: 223456766", "Address: 22 Blue Lane", "------------ Phase 3 ------------", "DOB: 01/23/01", "SSN: 123456777",
" 125556777", " 123455555", " 123456700", "Address: 5 Green Lane", "", "",
"")
RESULT:
c("------------ Phase 1 ------------", "DOB: 12/23/23", "SSN: 123456766, 123456777, 123456788, 123456799, 123456700", "Address: 5 Green Lane", "", "", "------------ Phase 2 ------------",
"DOB: 12/33/23", "SSN: 223456766", "Address: 22 Blue Lane", "------------ Phase 3 ------------", "DOB: 01/23/01", "SSN: 123456777, 125556777, 123455555, 123456700", "Address: 5 Green Lane", "", "",
"")
REGEX NOTES:
(?<=\d{9})
Positive lookbehind(?<=...)
. Matches if the index is preceded by nine ({9}
) digits\d
. (Does not consume characters.)[\s\",]*
Positive character class[...]
. Match 0 or more (*
) whitespace\s
, literal"
(\"
) or literal,
) characters. These characters are removed/deleted when the replacement string is applied.(\d{9})
Capture group(...)
, group 1. Referred to in the replacement string with$1
. Matches 9 ({9}
) digits\d
. Stores the captured characters in to group 1,$1
.
REPLACEMENT STRING NOTES:
- Replace matched (and consumed) characters with:
,
Literal comma,
followed by a literal space- Followed by the characters stored in capture group 1,
$1
.
text <- paste(d, collapse="\n")
andregmatches(text, gregexpr("(?:\\G(?!^)\\h*\\R\\h*|\\bSSN:\\h*)\\K\\d{9}\\b", text, perl=TRUE))
– Wiktor Stribiżew Commented Mar 25 at 18:30