I have a file that sometimes has additional values. Most people have a single SSN, but some people have multiple SSNs and they are separated on multiple lines. What is the regex I need for this? My desired output is an R list of SSNs.

The source file was read in using read_lines() as object f:

dput(f)

c("------------ Phase 1 ------------", "DOB:    12/23/23", "SSN:    123456766", 
"        123456777", "        123456788", "        123456799", 
"        123456700", "Address: 5 Green Lane", "", "", "------------ Phase 2 ------------", 
"DOB:    12/33/23", "SSN:    223456766", "Address: 22 Blue Lane", 
"")

and as shown in the text file as:

------------ Phase 1 ------------  
DOB:    12/23/23  
SSN:    123456766  
        123456777  
        123456788  
        123456799  
        123456700  
Address: 5 Green Lane  
  
  
------------ Phase 2 ------------  
DOB:    12/33/23  
SSN:    223456766  
Address: 22 Blue Lane

My current regex is: "SSN:\s+\d{9}(\n\s\d{9})?" and I've tried various regex options like dotall and multiline without success.

Regarding output structure, my preference is as simple as possible, ideally as a data.frame. (I'm looking for suggestions from you as to best practices). I realize that's awkward for multiples in a data.frame as it would have to extend across multiple columns. Otherwise a list would be able to handle multiples, or a single cell in a data.frame separated by commas.

Thanks

The source file was read in using read_lines() as object f:

dput(f)

c("------------ Phase 1 ------------", "DOB:    12/23/23", "SSN:    123456766", 
"        123456777", "        123456788", "        123456799", 
"        123456700", "Address: 5 Green Lane", "", "", "------------ Phase 2 ------------", 
"DOB:    12/33/23", "SSN:    223456766", "Address: 22 Blue Lane", 
"")

and as shown in the text file as:

------------ Phase 1 ------------  
DOB:    12/23/23  
SSN:    123456766  
        123456777  
        123456788  
        123456799  
        123456700  
Address: 5 Green Lane  
  
  
------------ Phase 2 ------------  
DOB:    12/33/23  
SSN:    223456766  
Address: 22 Blue Lane

My current regex is: "SSN:\s+\d{9}(\n\s\d{9})?" and I've tried various regex options like dotall and multiline without success.

Thanks

Share Improve this question edited Mar 25 at 18:10 asked Mar 25 at 17:25 hackR 1,51117 silver badges26 bronze badges

And what is the actual expected output structure? It is easy to just get all these SSNs, but if you want to group them per phase/address/DOB, you need to precise this. – Wiktor Stribiżew Commented Mar 25 at 17:34
1 text <- paste(d, collapse="\n") and regmatches(text, gregexpr("(?:\\G(?!^)\\h*\\R\\h*|\\bSSN:\\h*)\\K\\d{9}\\b", text, perl=TRUE)) – Wiktor Stribiżew Commented Mar 25 at 18:30

Add a comment |

5 Answers 5

Sorted by: Reset to default 6

Perhaps "SSN:" for a positive look-behind and all digits + whitespace that will be squished and split in next steps?

library(stringr)
txt <- 
"------------ Phase 1 ------------  
  DOB:    12/23/23  
SSN:    123456766  
123456777  
123456788  
123456799  
123456700  
Address: 5 Green Lane  


------------ Phase 2 ------------  
  DOB:    12/33/23  
SSN:    223456766  
Address: 22 Blue Lane  
"

# in case of lines vector from readLines() / readr::read_lines(), 
# first collapse to a single string:
# txt <- paste0(f, collapse = "\n")
# or use readr::read_file()

str_extract_all(txt, "(?<=SSN:)[\\s\\d]+", simplify = TRUE) |> 
  str_squish() |> 
  str_split(" ")
#> [[1]]
#> [1] "123456766" "123456777" "123456788" "123456799" "123456700"
#> 
#> [[2]]
#> [1] "223456766"

^{Created on 2025-03-25 with reprex v2.1.1}

This gives a data frame where the SSN columnn is a list of character vectors, one character vector per row.

If you would prefer a single \n-separated string in each SSN component omit the last line.

If you would prefer a comma separated string for the SSNs then replace the last line with DF$SSN <- chartr("\n", ",", DF$SSN)

If you would prefer that each row be repeated for each SSN then replace the last line with tidyr::separate_longer_delim(DF, SSN, delim = "\n")

The only regexes used are "^--" in grep to remove the lines beginning with minus signs and the "\n" in strsplit. We could get rid of the first with an appropriate use of startsWith and the second by adding the fixed=TRUE argument to strsplit but these regexes seems simple enough that we leave them as is.

DF <- read.dcf(textConnection(grep("^--", f, value = TRUE, invert = TRUE))) |>
  as.data.frame()
DF$SSN <- strsplit(DF$SSN, "\n")

I'll throw my hat into the ring ...

If there's a chance that any of the other fields (not just SSN:) could be > 1, then this method produces consistent set of values. There are a few options:

Option 1: simplest, long-format:

library(dplyr)
library(tidyr) # separate_wider_delim, fill
# txt <- c(...)
L <- split(txt, cumsum(grepl("^-", txt)))
names(L) <- sapply(L, `[[`, 1)
out <- lapply(L, function(x) data.frame(z=x[-1])) |>
  bind_rows(.id = "phase") |>
  separate_wider_delim(z, delim = ":", names = c("x", "y"), too_few = "align_end") |>
  fill(x) |>
  mutate(y = trimws(y)) |>
  filter(nzchar(y) & !is.na(y))
out
# # A tibble: 10 × 3
#    phase                             x       y           
#    <chr>                             <chr>   <chr>       
#  1 ------------ Phase 1 ------------ DOB     12/23/23    
#  2 ------------ Phase 1 ------------ SSN     123456766   
#  3 ------------ Phase 1 ------------ SSN     123456777   
#  4 ------------ Phase 1 ------------ SSN     123456788   
#  5 ------------ Phase 1 ------------ SSN     123456799   
#  6 ------------ Phase 1 ------------ SSN     123456700   
#  7 ------------ Phase 1 ------------ Address 5 Green Lane
#  8 ------------ Phase 2 ------------ DOB     12/33/23    
#  9 ------------ Phase 2 ------------ SSN     223456766   
# 10 ------------ Phase 2 ------------ Address 22 Blue Lane

Option 2: if you want exactly 1 row for each phase/SSN, then you can list-column the y column:

out <- summarize(out, .by = c(phase, x), across(everything(), list))
out
# # A tibble: 6 × 3
#   phase                             x       y        
#   <chr>                             <chr>   <list>   
# 1 ------------ Phase 1 ------------ DOB     <chr [1]>
# 2 ------------ Phase 1 ------------ SSN     <chr [5]>
# 3 ------------ Phase 1 ------------ Address <chr [1]>
# 4 ------------ Phase 2 ------------ DOB     <chr [1]>
# 5 ------------ Phase 2 ------------ SSN     <chr [1]>
# 6 ------------ Phase 2 ------------ Address <chr [1]>

where each value in my (unimpressively-named) y column can be length 1+:

out$y[2]
# [[1]]
# [1] "123456766" "123456777" "123456788" "123456799" "123456700"

Option 3: if the "fields" (DOB, SSN, etc) are consistent enough and you want it in a wider format, once can pivot this:

pivot_wider(out, id_cols = phase, names_from = "x", values_from = "y")
# # A tibble: 2 × 4
#   phase                             DOB       SSN       Address  
#   <chr>                             <list>    <list>    <list>   
# 1 ------------ Phase 1 ------------ <chr [1]> <chr [5]> <chr [1]>
# 2 ------------ Phase 2 ------------ <chr [1]> <chr [1]> <chr [1]>

Caveat: options 2 and 3 are predicated on your tools (and your brain) being able to process the list-columns of a nested frame. Sometimes it's an "efficiency" or "tidy-models" kind of thing, other times it's purely personal preference.

1 read.fwf

For the given format, you can use read.fwf

library(tidyverse)
read.fwf(textConnection(text), widths = c(8, 25)) %>% 
  filter(!grepl("^--", V1)) %>% na.omit() %>%   
  mutate(Phase = cumsum(grepl("DOB:", V1)), 
         V1 = gsub(":","",na_if(trimws(V1), ""))) %>% fill(V1)
  

        V1              V2 Phase
1      DOB      12/23/23       1
2      SSN     123456766       1
3      SSN     123456777       1
4      SSN     123456788       1
5      SSN     123456799       1
6      SSN     123456700       1
7  Address  5 Green Lane       1
10     DOB      12/33/23       2
11     SSN     223456766       2
12 Address  22 Blue Lane       2

2 Gsub

This questions was originally about the SSN numbers, not about the whole data structure. Assuming you are only interested in the SSN numbers as vector, you can use grep to find the rows that start with "SSN" and "Address" and read the lines in between, remove SSN: and one or many whitespaces \\s+ using gsub

lines <- readLines(textConnection(text))

unlist(mapply(function(start, end) {
  gsub("\\s+", "", gsub("SSN:", "", lines[start:(end - 1)]))
}, grep("SSN:", lines), grep("Address:", lines)))

[1]Green "123456766"Lane "123456777" "123456788" "123456799" "123456700" "223456766"

Or to get a list remove the unlist()

mapply(function(start, end) {
  gsub("\\s+", "", gsub("SSN:", "", lines[start:(end - 1)]))
}, grep("SSN:", lines), grep("Address:", lines))

[[1]]
[1] "123456766" "123456777" "123456788" "123456799"223456766 "123456700"

[[2]]     
[1]12 "223456766"

Test data

text <- "------------ Phase 1 ------------  
DOB:    12/23/23  
SSN:    123456766  
        123456777  
        123456788  
        123456799  
        123456700  
Address: 5 Green Lane  
  
  
------------ Phase 2 ------------  
DOB:    12/33/23  
SSN:    223456766  
Address: 22 Blue Lane  "

I don't know if this would be helpful. However, this is a way you can squeeze all social security numbers inside one element starting with SSN: followed by a comma separated list using regex.

REGEX PATTERN (Java 8 flavor):

(?<=\d{9})[\s\",]*(\d{9})

REPLACEMENT STRING:

, $1

Regex Demo: https://regex101/r/JB79X9/9

TEST STRING (Added Phase 3):

c("------------ Phase 1 ------------", "DOB:    12/23/23", "SSN:    123456766", 
"        123456777", "        123456788", "        123456799", 
"        123456700", "Address: 5 Green Lane", "", "", "------------ Phase 2 ------------", 
"DOB:    12/33/23", "SSN:    223456766", "Address: 22 Blue Lane", "------------ Phase 3 ------------", "DOB:    01/23/01", "SSN:    123456777", 
"        125556777", "        123455555", "        123456700", "Address: 5 Green Lane", "", "", 
"")

RESULT:

c("------------ Phase 1 ------------", "DOB:    12/23/23", "SSN:    123456766, 123456777, 123456788, 123456799, 123456700", "Address: 5 Green Lane", "", "", "------------ Phase 2 ------------", 
"DOB:    12/33/23", "SSN:    223456766", "Address: 22 Blue Lane", "------------ Phase 3 ------------", "DOB:    01/23/01", "SSN:    123456777, 125556777, 123455555, 123456700", "Address: 5 Green Lane", "", "", 
"")

REGEX NOTES:

(?<=\d{9}) Positive lookbehind (?<=...). Matches if the index is preceded by nine ({9}) digits \d. (Does not consume characters.)
[\s\",]* Positive character class [...]. Match 0 or more (*) whitespace \s, literal " (\") or literal ,) characters. These characters are removed/deleted when the replacement string is applied.
(\d{9}) Capture group (...), group 1. Referred to in the replacement string with $1. Matches 9 ({9}) digits \d. Stores the captured characters in to group 1, $1.

REPLACEMENT STRING NOTES:

Replace matched (and consumed) characters with:
, Literal comma , followed by a literal space .
Followed by the characters stored in capture group 1, $1.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - Regex Finding values across multiple lines - Stack Overflow

5 Answers 5

1 read.fwf

2 Gsub

Test data

与本文相关的文章

评论列表(0)