I have a very similar problem to solve like this. However, I am not interested in sorting. I am interested in grouping (wrong word perhaps) the same string objects together and sum the value attached with string. Secondly, I would like to remove a string from the rows. I have prepared an example data frame. I have prepared this as close as the post I refered earlier.

branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ", 
          "Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
          "Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
          "Harry (1), Potter (32), Harry (2)",
          "Timothy (3), HQ, Sara (44), HQ"
          )
> performance <- data.frame(branch, perf)
> performance
  branch                                                    perf
1     OL  Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ
2     CA   Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)
3     PL Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)
4     OR                       Harry (1), Potter (32), Harry (2)
5     FL                          Timothy (3), HQ, Sara (44), HQ

In the first row, I have Mattheu two times. I want to have it once with the numbers summed up. That means, it should be Mattheu (34). Secondly, I would like to have the string HQ removed.

This is the output expectation of the second column:

[1] "Mattheu (34), Jessica (32), Tom (10)"
[2] "Tobias (68), Kurt (22), Mathias (44)"
[3] "Tom (52), Giti (88), Patel (54)"     
[4] "Harry (3), Potter (32)"              
[5] "Timothy (3), Sara (44)"

How to get the expected output?

branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ", 
          "Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
          "Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
          "Harry (1), Potter (32), Harry (2)",
          "Timothy (3), HQ, Sara (44), HQ"
          )
> performance <- data.frame(branch, perf)
> performance
  branch                                                    perf
1     OL  Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ
2     CA   Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)
3     PL Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)
4     OR                       Harry (1), Potter (32), Harry (2)
5     FL                          Timothy (3), HQ, Sara (44), HQ

In the first row, I have Mattheu two times. I want to have it once with the numbers summed up. That means, it should be Mattheu (34). Secondly, I would like to have the string HQ removed.

This is the output expectation of the second column:

[1] "Mattheu (34), Jessica (32), Tom (10)"
[2] "Tobias (68), Kurt (22), Mathias (44)"
[3] "Tom (52), Giti (88), Patel (54)"     
[4] "Harry (3), Potter (32)"              
[5] "Timothy (3), Sara (44)"

How to get the expected output?

Share Improve this question edited Feb 7 at 18:15 CommunityBot 11 silver badge asked Feb 6 at 11:54 small_lebowski 7731 gold badge7 silver badges23 bronze badges

Add a comment |

4 Answers 4

Sorted by: Reset to default 4

Here's an option using dplyr and tidyr library.

library(dplyr)
library(tidyr)

performance %>%
  separate_longer_delim(perf, ", ") %>%
  filter(perf != "HQ") %>%
  separate_wider_regex(perf, 
                      c(name = "[A-Za-z]+", "\\s+\\(", score = "\\d+", "\\)")) %>%
  type.convert(as.is = TRUE) %>%
  summarise(score = sum(score), .by = c(branch, name)) %>%
  summarise(perf = paste(name, '(', score, ')', collapse = ","), .by = branch)

# A tibble: 5 × 2
#  branch perf                                    
#  <chr>  <chr>                                   
#1 OL     Mattheu ( 34 ),Jessica ( 32 ),Tom ( 10 )
#2 CA     Tobias ( 68 ),Kurt ( 22 ),Mathias ( 44 )
#3 PL     Tom ( 52 ),Giti ( 88 ),Patel ( 54 )     
#4 OR     Harry ( 3 ),Potter ( 32 )               
#5 FL     Timothy ( 3 ),Sara ( 44 )

Split the data in separate rows using separate_longer_delim based on ", "
remove the "HQ" rows
separate the name and number in two different columns (name and score). The regex used here is important to correctly identify the name and score values.
sum the values for each name and branch
combine the rows for each name to get it in original format.

In base R you could do

sum_similar <- function(row){
  matches <- regmatches(row, gregexpr("([A-Za-z]+) \\((\\d+)\\)", row))[[1]]
  df <- data.frame(
    names = gsub(" \\(\\d+\\)", "", matches),  # Extract names
    count = as.numeric(gsub("[^0-9]", "", matches))  # Extract numbers
  )
  result <- aggregate(count ~ names, data = df, FUN = sum) # Aggregate by sum
  paste0(result$names, " (", result$count, ")", collapse = ", ") # output
}    
performance$perf <- sapply(performance$perf, sum_similar)

giving

branch	perf
OL	Jessica (32), Mattheu (34), Tom (10)
CA	Kurt (22), Mathias (44), Tobias (68)
PL	Giti (88), Patel (54), Tom (52)
OR	Harry (3), Potter (32)
FL	Sara (44), Timothy (3)

regmatches finds all elements which have some text and then some number in brackets() and stores them in a vector: "Mattheu (12)" "Jessica (32)" "Mattheu (22)" "Tom (10)"
gsub(" \$\\d+\$", "", matches) replaces all "(number)" with nothing (""), so that only names stay: "Mattheu (22)" -> "Mattheu"
as.numeric(gsub("[^0-9]", "", matches)) extracts any number from a string and converts it to an R-number "Tom (10)" --> 10
Both are stored in a dataframe.
aggregate(count ~ names, data = df, FUN = sum) sums up similar names in one count cell. It basically summarises rows with similar names
paste0(result$names, " (", result$count, ")", collapse = ", ") finally pastes all aggregated names and counts back together as a string: paste0(c("name1", "name2"), " (", c(1, 2), ")", collapse = ", ") --> "name1 (1), name2 (2)"
sapply(performance$perf, sum_similar) finally applies this function to all rows (each string of column "perf")

Explanation

> t <- regmatches("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ", gregexpr("([A-Za-z]+) \\((\\d+)\\)", "Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ"))[[1]]
> t
[1] "Mattheu (12)" "Jessica (32)" "Mattheu (22)" "Tom (10)"    
> gsub("[^A-Za-z]", "", t) # replace everything except Text or text with ""
[1] "Mattheu" "Jessica" "Mattheu" "Tom"    
> gsub(" \\(\\d+\\)", "", t)
[1] "Mattheu" "Jessica" "Mattheu" "Tom"    
> as.numeric(gsub("[^0-9]", "", t))
[1] 12 32 22 10
> 
> aggregate(count ~ names, data = data.frame(count = as.numeric(gsub("[^0-9]", "", t)), names = gsub("[^A-Za-z]", "", t)), FUN = sum)
    names count
1 Jessica    32
2 Mattheu    34
3     Tom    10

First, we could strsplit at ', '. Over the resulting list, we sapply a function g() that gsubs the parentheses away, greps out those with numbers, strsplits at spaces, rbinds and type.converts, xtabs, and finally sprintfs the desired result comma-separated using `toString().

> f <- \(x) {
+   s <- strsplit(x, ', ')
+   x <- s[[1]]
+   g <- \(x) {
+     a <- gsub('\\(|\\)', '', x[grep('\\d', x)]) |> 
+       strsplit(' ') |> 
+       do.call(what='rbind.data.frame') |> 
+       setNames(c('u', 'n')) |> 
+       type.convert(as.is=TRUE) |> 
+       xtabs(fo=n ~ u)
+     sprintf('%s (%s)', names(a), a) |> 
+       toString()
+   }
+   sapply(s, g)
+ }
> 
> performance |> 
+   transform(perf=f(perf))
  branch                                 perf
1     OL Jessica (32), Mattheu (34), Tom (10)
2     CA Kurt (22), Mathias (44), Tobias (68)
3     PL      Giti (88), Patel (54), Tom (52)
4     OR               Harry (3), Potter (32)
5     FL               Sara (44), Timothy (3)

OP didn't really specify how strings should be sorted, so here's alphabetical sorting.

branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ", 
          "Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
          "Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
          "Harry (1), Potter (32), Harry (2)",
          "Timothy (3), HQ, Sara (44), HQ"
)
performance <- data.frame(branch, perf)

performance$performance2 <- sapply(
  performance$perf,
  \(x) {
    # split by space and exclude "HQ"  
    line=setdiff(strsplit(x,",\\s?", perl=TRUE)[[1]],"HQ")

    mydf <- as.data.frame(
      matrix(
        # to flatten strsplit()
        unlist(
          strsplit(
            # split "Mytext (mynumber)" into \1 My Text and \2 My Number without parenthesis 
            gsub("([a-zA-Z]+)\\s\\((\\d+)\\)","\\1 \\2",line),"\\s")),
        #My Text , My Number therefore number of column of the matrix by row =2 
        ncol = 2, 
        byrow = TRUE, 
        # dimnames : row name mandatory 1:,length(line), col name: somebody, mynumb
        dimnames=list(1:(length(line)),c("somebody","mynumb"))
      ) 
    )

    # convert mynumb as numeric
    mydf$mynumb <- as.numeric(mydf$mynumb)

    # sum (mynumb) group by somebody 
    myagg<- aggregate(mynumb ~ somebody, data=mydf,FUN=sum)

    paste0(myagg$somebody," (",myagg$mynumb,")",collapse =", ")

  },
  simplify = FALSE, 
  USE.NAMES = FALSE
)

performance[,c("branch","performance2")]
#>   branch                         performance2
#> 1     OL Jessica (32), Mattheu (34), Tom (10)
#> 2     CA Kurt (22), Mathias (44), Tobias (68)
#> 3     PL      Giti (88), Patel (54), Tom (52)
#> 4     OR               Harry (3), Potter (32)
#> 5     FL               Sara (44), Timothy (3)

^{Created on 2025-02-06 with reprex v2.1.1}

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - Grouping string together and sum it together - Stack Overflow

4 Answers 4

Explanation

与本文相关的文章

评论列表(0)