I have a very similar problem to solve like this. However, I am not interested in sorting. I am interested in grouping (wrong word perhaps) the same string objects together and sum the value attached with string. Secondly, I would like to remove a string from the rows. I have prepared an example data frame. I have prepared this as close as the post I refered earlier.
branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ",
"Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
"Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
"Harry (1), Potter (32), Harry (2)",
"Timothy (3), HQ, Sara (44), HQ"
)
> performance <- data.frame(branch, perf)
> performance
branch perf
1 OL Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ
2 CA Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)
3 PL Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)
4 OR Harry (1), Potter (32), Harry (2)
5 FL Timothy (3), HQ, Sara (44), HQ
In the first row, I have Mattheu two times. I want to have it once with the numbers summed up. That means, it should be Mattheu (34). Secondly, I would like to have the string HQ removed.
This is the output expectation of the second column:
[1] "Mattheu (34), Jessica (32), Tom (10)"
[2] "Tobias (68), Kurt (22), Mathias (44)"
[3] "Tom (52), Giti (88), Patel (54)"
[4] "Harry (3), Potter (32)"
[5] "Timothy (3), Sara (44)"
How to get the expected output?
I have a very similar problem to solve like this. However, I am not interested in sorting. I am interested in grouping (wrong word perhaps) the same string objects together and sum the value attached with string. Secondly, I would like to remove a string from the rows. I have prepared an example data frame. I have prepared this as close as the post I refered earlier.
branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ",
"Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
"Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
"Harry (1), Potter (32), Harry (2)",
"Timothy (3), HQ, Sara (44), HQ"
)
> performance <- data.frame(branch, perf)
> performance
branch perf
1 OL Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ
2 CA Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)
3 PL Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)
4 OR Harry (1), Potter (32), Harry (2)
5 FL Timothy (3), HQ, Sara (44), HQ
In the first row, I have Mattheu two times. I want to have it once with the numbers summed up. That means, it should be Mattheu (34). Secondly, I would like to have the string HQ removed.
This is the output expectation of the second column:
[1] "Mattheu (34), Jessica (32), Tom (10)"
[2] "Tobias (68), Kurt (22), Mathias (44)"
[3] "Tom (52), Giti (88), Patel (54)"
[4] "Harry (3), Potter (32)"
[5] "Timothy (3), Sara (44)"
How to get the expected output?
Share Improve this question edited Feb 7 at 18:15 CommunityBot 11 silver badge asked Feb 6 at 11:54 small_lebowskismall_lebowski 7731 gold badge7 silver badges23 bronze badges4 Answers
Reset to default 4Here's an option using dplyr
and tidyr
library.
library(dplyr)
library(tidyr)
performance %>%
separate_longer_delim(perf, ", ") %>%
filter(perf != "HQ") %>%
separate_wider_regex(perf,
c(name = "[A-Za-z]+", "\\s+\\(", score = "\\d+", "\\)")) %>%
type.convert(as.is = TRUE) %>%
summarise(score = sum(score), .by = c(branch, name)) %>%
summarise(perf = paste(name, '(', score, ')', collapse = ","), .by = branch)
# A tibble: 5 × 2
# branch perf
# <chr> <chr>
#1 OL Mattheu ( 34 ),Jessica ( 32 ),Tom ( 10 )
#2 CA Tobias ( 68 ),Kurt ( 22 ),Mathias ( 44 )
#3 PL Tom ( 52 ),Giti ( 88 ),Patel ( 54 )
#4 OR Harry ( 3 ),Potter ( 32 )
#5 FL Timothy ( 3 ),Sara ( 44 )
- Split the data in separate rows using
separate_longer_delim
based on ", " - remove the "HQ" rows
- separate the name and number in two different columns (
name
andscore
). The regex used here is important to correctly identify thename
andscore
values. sum
the values for eachname
andbranch
- combine the rows for each
name
to get it in original format.
In base R you could do
sum_similar <- function(row){
matches <- regmatches(row, gregexpr("([A-Za-z]+) \\((\\d+)\\)", row))[[1]]
df <- data.frame(
names = gsub(" \\(\\d+\\)", "", matches), # Extract names
count = as.numeric(gsub("[^0-9]", "", matches)) # Extract numbers
)
result <- aggregate(count ~ names, data = df, FUN = sum) # Aggregate by sum
paste0(result$names, " (", result$count, ")", collapse = ", ") # output
}
performance$perf <- sapply(performance$perf, sum_similar)
giving
branch | perf |
---|---|
OL | Jessica (32), Mattheu (34), Tom (10) |
CA | Kurt (22), Mathias (44), Tobias (68) |
PL | Giti (88), Patel (54), Tom (52) |
OR | Harry (3), Potter (32) |
FL | Sara (44), Timothy (3) |
regmatches
finds all elements which have some text and then some number in brackets() and stores them in a vector: "Mattheu (12)" "Jessica (32)" "Mattheu (22)" "Tom (10)"gsub(" \\(\\d+\\)", "", matches)
replaces all "(number)" with nothing (""), so that only names stay: "Mattheu (22)" -> "Mattheu"as.numeric(gsub("[^0-9]", "", matches))
extracts any number from a string and converts it to an R-number "Tom (10)" --> 10- Both are stored in a dataframe.
aggregate(count ~ names, data = df, FUN = sum)
sums up similar names in one count cell. It basically summarises rows with similar namespaste0(result$names, " (", result$count, ")", collapse = ", ")
finally pastes all aggregated names and counts back together as a string:paste0(c("name1", "name2"), " (", c(1, 2), ")", collapse = ", ")
--> "name1 (1), name2 (2)"sapply(performance$perf, sum_similar)
finally applies this function to all rows (each string of column "perf")
Explanation
> t <- regmatches("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ", gregexpr("([A-Za-z]+) \\((\\d+)\\)", "Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ"))[[1]]
> t
[1] "Mattheu (12)" "Jessica (32)" "Mattheu (22)" "Tom (10)"
> gsub("[^A-Za-z]", "", t) # replace everything except Text or text with ""
[1] "Mattheu" "Jessica" "Mattheu" "Tom"
> gsub(" \\(\\d+\\)", "", t)
[1] "Mattheu" "Jessica" "Mattheu" "Tom"
> as.numeric(gsub("[^0-9]", "", t))
[1] 12 32 22 10
>
> aggregate(count ~ names, data = data.frame(count = as.numeric(gsub("[^0-9]", "", t)), names = gsub("[^A-Za-z]", "", t)), FUN = sum)
names count
1 Jessica 32
2 Mattheu 34
3 Tom 10
First, we could strsplit
at ', '
. Over the resulting list, we sapply
a function g()
that gsub
s the parentheses away, grep
s out those with numbers, strsplit
s at spaces, rbind
s and type.convert
s, xtab
s, and finally sprintf
s the desired result comma-separated using `toString().
> f <- \(x) {
+ s <- strsplit(x, ', ')
+ x <- s[[1]]
+ g <- \(x) {
+ a <- gsub('\\(|\\)', '', x[grep('\\d', x)]) |>
+ strsplit(' ') |>
+ do.call(what='rbind.data.frame') |>
+ setNames(c('u', 'n')) |>
+ type.convert(as.is=TRUE) |>
+ xtabs(fo=n ~ u)
+ sprintf('%s (%s)', names(a), a) |>
+ toString()
+ }
+ sapply(s, g)
+ }
>
> performance |>
+ transform(perf=f(perf))
branch perf
1 OL Jessica (32), Mattheu (34), Tom (10)
2 CA Kurt (22), Mathias (44), Tobias (68)
3 PL Giti (88), Patel (54), Tom (52)
4 OR Harry (3), Potter (32)
5 FL Sara (44), Timothy (3)
OP didn't really specify how strings should be sorted, so here's alphabetical sorting.
branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ",
"Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
"Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
"Harry (1), Potter (32), Harry (2)",
"Timothy (3), HQ, Sara (44), HQ"
)
performance <- data.frame(branch, perf)
performance$performance2 <- sapply(
performance$perf,
\(x) {
# split by space and exclude "HQ"
line=setdiff(strsplit(x,",\\s?", perl=TRUE)[[1]],"HQ")
mydf <- as.data.frame(
matrix(
# to flatten strsplit()
unlist(
strsplit(
# split "Mytext (mynumber)" into \1 My Text and \2 My Number without parenthesis
gsub("([a-zA-Z]+)\\s\\((\\d+)\\)","\\1 \\2",line),"\\s")),
#My Text , My Number therefore number of column of the matrix by row =2
ncol = 2,
byrow = TRUE,
# dimnames : row name mandatory 1:,length(line), col name: somebody, mynumb
dimnames=list(1:(length(line)),c("somebody","mynumb"))
)
)
# convert mynumb as numeric
mydf$mynumb <- as.numeric(mydf$mynumb)
# sum (mynumb) group by somebody
myagg<- aggregate(mynumb ~ somebody, data=mydf,FUN=sum)
paste0(myagg$somebody," (",myagg$mynumb,")",collapse =", ")
},
simplify = FALSE,
USE.NAMES = FALSE
)
performance[,c("branch","performance2")]
#> branch performance2
#> 1 OL Jessica (32), Mattheu (34), Tom (10)
#> 2 CA Kurt (22), Mathias (44), Tobias (68)
#> 3 PL Giti (88), Patel (54), Tom (52)
#> 4 OR Harry (3), Potter (32)
#> 5 FL Sara (44), Timothy (3)
Created on 2025-02-06 with reprex v2.1.1