I have last names (left) and first names (right) separated by a comma.
Among the last names, I often (but not always) have duplicates separated by an underscore. How to remove the duplicates and, for compound names, replace the underscore with a space?
Also, first names never have duplicates but always begin with an underscore, which I would like to replace with a space. For compound first names also separated by an underscore, I would like to replace this underscore with a space.
I have thousands of lines and I'm struggling to find the solution.
Thanks for help
Input data:
> dat0
name_ko
1 BLA_BLA,_BLIM
2 CLO_CLO,_SPITCH_SPLOTCH
3 BAD_BOY,_GOOD
4 GOOD_BOY,_BAD_GIRL
Desired output:
> dat1
name_ok
1 BLA, BLIM
2 CLO, SPITCH SPLOTCH
3 BAD BOY, GOOD
4 GOOD BOY, BAD GIRL
Data:
name_ko <- c(
"BLA_BLA,_BLIM",
"CLO_CLO,_SPITCH_SPLOTCH",
"BAD_BOY,_GOOD",
"GOOD_BOY,_BAD_GIRL")
dat0 <- data.frame(name_ko)
I have last names (left) and first names (right) separated by a comma.
Among the last names, I often (but not always) have duplicates separated by an underscore. How to remove the duplicates and, for compound names, replace the underscore with a space?
Also, first names never have duplicates but always begin with an underscore, which I would like to replace with a space. For compound first names also separated by an underscore, I would like to replace this underscore with a space.
I have thousands of lines and I'm struggling to find the solution.
Thanks for help
Input data:
> dat0
name_ko
1 BLA_BLA,_BLIM
2 CLO_CLO,_SPITCH_SPLOTCH
3 BAD_BOY,_GOOD
4 GOOD_BOY,_BAD_GIRL
Desired output:
> dat1
name_ok
1 BLA, BLIM
2 CLO, SPITCH SPLOTCH
3 BAD BOY, GOOD
4 GOOD BOY, BAD GIRL
Data:
name_ko <- c(
"BLA_BLA,_BLIM",
"CLO_CLO,_SPITCH_SPLOTCH",
"BAD_BOY,_GOOD",
"GOOD_BOY,_BAD_GIRL")
dat0 <- data.frame(name_ko)
Share
Improve this question
asked Mar 13 at 22:27
denisdenis
8425 silver badges14 bronze badges
3
|
1 Answer
Reset to default 3You can try
name_ok = gsub("_"," ",gsub("(\\b\\w+)_(\\1)", "\\1",name_ko))
"BLA, BLIM"
"CLO, SPITCH SPLOTCH"
"BAD BOY, GOOD"
"GOOD BOY, BAD GIRL"
To handle triplets and more as Margusl and zephryl suggested - thank you
name_ko <- c(
"BLA_BLA,_BLIM",
"CLO_CLO,_SPITCH_SPLOTCH",
"BAD_BOY,_GOOD",
"GOOD_BOY,_BAD_GIRL",
"BAD_BAD_BAD_BOY_BOY,_GOOD",
"BAD_BOY_BAD_BOY,_GOOD"
)
name_ok = sapply(strsplit(name_ko, ","), function(x) {
last_names <- unique(unlist(strsplit(trimws(x[1]), "_")))
first_names <- gsub("_", " ",trimws(x[2]))
paste(paste(last_names, collapse = " "), first_names, sep = ", ")
})
"BLA, BLIM"
"CLO, SPITCH SPLOTCH"
"BAD BOY, GOOD"
"GOOD BOY, BAD GIRL"
"BAD BOY, GOOD"
"BAD BOY, GOOD"
BAD_BAD_BAD_BOY_BOY,_GOOD
? – margusl Commented Mar 13 at 23:00"BAD_BOY_BAD_BOY,_GOOD"
? – zephryl Commented Mar 13 at 23:03