最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - Grouping string together and sum it together - Stack Overflow

programmeradmin0浏览0评论

I have a very similar problem to solve like this. However, I am not interested in sorting. I am interested in grouping (wrong word perhaps) the same string objects together and sum the value attached with string. Secondly, I would like to remove a string from the rows. I have prepared an example data frame. I have prepared this as close as the post I refered earlier.

branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ", 
          "Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
          "Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
          "Harry (1), Potter (32), Harry (2)",
          "Timothy (3), HQ, Sara (44), HQ"
          )
> performance <- data.frame(branch, perf)
> performance
  branch                                                    perf
1     OL  Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ
2     CA   Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)
3     PL Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)
4     OR                       Harry (1), Potter (32), Harry (2)
5     FL                          Timothy (3), HQ, Sara (44), HQ

In the first row, I have Mattheu two times. I want to have it once with the numbers summed up. That means, it should be Mattheu (34). Secondly, I would like to have the string HQ removed.

This is the output expectation of the second column:

[1] "Mattheu (34), Jessica (32), Tom (10)"
[2] "Tobias (68), Kurt (22), Mathias (44)"
[3] "Tom (52), Giti (88), Patel (54)"     
[4] "Harry (3), Potter (32)"              
[5] "Timothy (3), Sara (44)"

How to get the expected output?

I have a very similar problem to solve like this. However, I am not interested in sorting. I am interested in grouping (wrong word perhaps) the same string objects together and sum the value attached with string. Secondly, I would like to remove a string from the rows. I have prepared an example data frame. I have prepared this as close as the post I refered earlier.

branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ", 
          "Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
          "Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
          "Harry (1), Potter (32), Harry (2)",
          "Timothy (3), HQ, Sara (44), HQ"
          )
> performance <- data.frame(branch, perf)
> performance
  branch                                                    perf
1     OL  Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ
2     CA   Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)
3     PL Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)
4     OR                       Harry (1), Potter (32), Harry (2)
5     FL                          Timothy (3), HQ, Sara (44), HQ

In the first row, I have Mattheu two times. I want to have it once with the numbers summed up. That means, it should be Mattheu (34). Secondly, I would like to have the string HQ removed.

This is the output expectation of the second column:

[1] "Mattheu (34), Jessica (32), Tom (10)"
[2] "Tobias (68), Kurt (22), Mathias (44)"
[3] "Tom (52), Giti (88), Patel (54)"     
[4] "Harry (3), Potter (32)"              
[5] "Timothy (3), Sara (44)"

How to get the expected output?

Share Improve this question edited Feb 7 at 18:15 CommunityBot 11 silver badge asked Feb 6 at 11:54 small_lebowskismall_lebowski 7731 gold badge7 silver badges23 bronze badges
Add a comment  | 

4 Answers 4

Reset to default 4

Here's an option using dplyr and tidyr library.

library(dplyr)
library(tidyr)

performance %>%
  separate_longer_delim(perf, ", ") %>%
  filter(perf != "HQ") %>%
  separate_wider_regex(perf, 
                      c(name = "[A-Za-z]+", "\\s+\\(", score = "\\d+", "\\)")) %>%
  type.convert(as.is = TRUE) %>%
  summarise(score = sum(score), .by = c(branch, name)) %>%
  summarise(perf = paste(name, '(', score, ')', collapse = ","), .by = branch)

# A tibble: 5 × 2
#  branch perf                                    
#  <chr>  <chr>                                   
#1 OL     Mattheu ( 34 ),Jessica ( 32 ),Tom ( 10 )
#2 CA     Tobias ( 68 ),Kurt ( 22 ),Mathias ( 44 )
#3 PL     Tom ( 52 ),Giti ( 88 ),Patel ( 54 )     
#4 OR     Harry ( 3 ),Potter ( 32 )               
#5 FL     Timothy ( 3 ),Sara ( 44 ) 
  1. Split the data in separate rows using separate_longer_delim based on ", "
  2. remove the "HQ" rows
  3. separate the name and number in two different columns (name and score). The regex used here is important to correctly identify the name and score values.
  4. sum the values for each name and branch
  5. combine the rows for each name to get it in original format.

In base R you could do

sum_similar <- function(row){
  matches <- regmatches(row, gregexpr("([A-Za-z]+) \\((\\d+)\\)", row))[[1]]
  df <- data.frame(
    names = gsub(" \\(\\d+\\)", "", matches),  # Extract names
    count = as.numeric(gsub("[^0-9]", "", matches))  # Extract numbers
  )
  result <- aggregate(count ~ names, data = df, FUN = sum) # Aggregate by sum
  paste0(result$names, " (", result$count, ")", collapse = ", ") # output
}    
performance$perf <- sapply(performance$perf, sum_similar)

giving

branch perf
OL Jessica (32), Mattheu (34), Tom (10)
CA Kurt (22), Mathias (44), Tobias (68)
PL Giti (88), Patel (54), Tom (52)
OR Harry (3), Potter (32)
FL Sara (44), Timothy (3)
  1. regmatches finds all elements which have some text and then some number in brackets() and stores them in a vector: "Mattheu (12)" "Jessica (32)" "Mattheu (22)" "Tom (10)"
  2. gsub(" \\(\\d+\\)", "", matches) replaces all "(number)" with nothing (""), so that only names stay: "Mattheu (22)" -> "Mattheu"
  3. as.numeric(gsub("[^0-9]", "", matches)) extracts any number from a string and converts it to an R-number "Tom (10)" --> 10
  4. Both are stored in a dataframe.
  5. aggregate(count ~ names, data = df, FUN = sum) sums up similar names in one count cell. It basically summarises rows with similar names
  6. paste0(result$names, " (", result$count, ")", collapse = ", ") finally pastes all aggregated names and counts back together as a string: paste0(c("name1", "name2"), " (", c(1, 2), ")", collapse = ", ") --> "name1 (1), name2 (2)"
  7. sapply(performance$perf, sum_similar) finally applies this function to all rows (each string of column "perf")

Explanation

> t <- regmatches("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ", gregexpr("([A-Za-z]+) \\((\\d+)\\)", "Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ"))[[1]]
> t
[1] "Mattheu (12)" "Jessica (32)" "Mattheu (22)" "Tom (10)"    
> gsub("[^A-Za-z]", "", t) # replace everything except Text or text with ""
[1] "Mattheu" "Jessica" "Mattheu" "Tom"    
> gsub(" \\(\\d+\\)", "", t)
[1] "Mattheu" "Jessica" "Mattheu" "Tom"    
> as.numeric(gsub("[^0-9]", "", t))
[1] 12 32 22 10
> 
> aggregate(count ~ names, data = data.frame(count = as.numeric(gsub("[^0-9]", "", t)), names = gsub("[^A-Za-z]", "", t)), FUN = sum)
    names count
1 Jessica    32
2 Mattheu    34
3     Tom    10 

First, we could strsplit at ', '. Over the resulting list, we sapply a function g() that gsubs the parentheses away, greps out those with numbers, strsplits at spaces, rbinds and type.converts, xtabs, and finally sprintfs the desired result comma-separated using `toString().

> f <- \(x) {
+   s <- strsplit(x, ', ')
+   x <- s[[1]]
+   g <- \(x) {
+     a <- gsub('\\(|\\)', '', x[grep('\\d', x)]) |> 
+       strsplit(' ') |> 
+       do.call(what='rbind.data.frame') |> 
+       setNames(c('u', 'n')) |> 
+       type.convert(as.is=TRUE) |> 
+       xtabs(fo=n ~ u)
+     sprintf('%s (%s)', names(a), a) |> 
+       toString()
+   }
+   sapply(s, g)
+ }
> 
> performance |> 
+   transform(perf=f(perf))
  branch                                 perf
1     OL Jessica (32), Mattheu (34), Tom (10)
2     CA Kurt (22), Mathias (44), Tobias (68)
3     PL      Giti (88), Patel (54), Tom (52)
4     OR               Harry (3), Potter (32)
5     FL               Sara (44), Timothy (3)

OP didn't really specify how strings should be sorted, so here's alphabetical sorting.

branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ", 
          "Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
          "Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
          "Harry (1), Potter (32), Harry (2)",
          "Timothy (3), HQ, Sara (44), HQ"
)
performance <- data.frame(branch, perf)

performance$performance2 <- sapply(
  performance$perf,
  \(x) {
    # split by space and exclude "HQ"  
    line=setdiff(strsplit(x,",\\s?", perl=TRUE)[[1]],"HQ")

    mydf <- as.data.frame(
      matrix(
        # to flatten strsplit()
        unlist(
          strsplit(
            # split "Mytext (mynumber)" into \1 My Text and \2 My Number without parenthesis 
            gsub("([a-zA-Z]+)\\s\\((\\d+)\\)","\\1 \\2",line),"\\s")),
        #My Text , My Number therefore number of column of the matrix by row =2 
        ncol = 2, 
        byrow = TRUE, 
        # dimnames : row name mandatory 1:,length(line), col name: somebody, mynumb
        dimnames=list(1:(length(line)),c("somebody","mynumb"))
      ) 
    )

    # convert mynumb as numeric
    mydf$mynumb <- as.numeric(mydf$mynumb)

    # sum (mynumb) group by somebody 
    myagg<- aggregate(mynumb ~ somebody, data=mydf,FUN=sum)

    paste0(myagg$somebody," (",myagg$mynumb,")",collapse =", ")

  },
  simplify = FALSE, 
  USE.NAMES = FALSE
)
performance[,c("branch","performance2")]
#>   branch                         performance2
#> 1     OL Jessica (32), Mattheu (34), Tom (10)
#> 2     CA Kurt (22), Mathias (44), Tobias (68)
#> 3     PL      Giti (88), Patel (54), Tom (52)
#> 4     OR               Harry (3), Potter (32)
#> 5     FL               Sara (44), Timothy (3)

Created on 2025-02-06 with reprex v2.1.1

发布评论

评论列表(0)

  1. 暂无评论