最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - Consistent significant digits in summary table - Stack Overflow

programmeradmin0浏览0评论

I noticed the significant digits being approximated for the same variable may result into weird sums for percentages in tbl_summary when the data contains small sized categories.

Below the example:

library(gtsummary)

gtsummary::tbl_cross(data = gtsummary::trial[1:30,], row = trt, col = stage, 
                     percent = "cell")

Created on 2025-01-29 with reprex v2.1.1

In the output table it's quite clear that the sum of rows and columns percentage isn't equal to the one reported in "Total". For example Drug A total is 53.4% if I sum all the stage percentages.

In some cases this could lead to sums above 100% (for example 97.8% would be approximated to 98% and the second category would report 2.2%).

The issue seems to be fixable using digits = 1 but this also modifies digits for integers.

I'm unable to figure out how to fine tune this aspect (either using function arguments or themes). Final objective would be to have the same amount of significant digits in all cells to make the sums accurate, for example keeping 1 significant digits in this case, or 0 significant digits using the full dataset (as categories are fairly large).

Any indications?

I noticed the significant digits being approximated for the same variable may result into weird sums for percentages in tbl_summary when the data contains small sized categories.

Below the example:

library(gtsummary)

gtsummary::tbl_cross(data = gtsummary::trial[1:30,], row = trt, col = stage, 
                     percent = "cell")

Created on 2025-01-29 with reprex v2.1.1

In the output table it's quite clear that the sum of rows and columns percentage isn't equal to the one reported in "Total". For example Drug A total is 53.4% if I sum all the stage percentages.

In some cases this could lead to sums above 100% (for example 97.8% would be approximated to 98% and the second category would report 2.2%).

The issue seems to be fixable using digits = 1 but this also modifies digits for integers.

I'm unable to figure out how to fine tune this aspect (either using function arguments or themes). Final objective would be to have the same amount of significant digits in all cells to make the sums accurate, for example keeping 1 significant digits in this case, or 0 significant digits using the full dataset (as categories are fairly large).

Any indications?

Share Improve this question edited Jan 29 at 16:58 M-- 29.6k10 gold badges70 silver badges106 bronze badges Recognized by R Language Collective asked Jan 29 at 16:40 devsterdevster 1591 gold badge3 silver badges13 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1
gtsummary::tbl_cross(data = gtsummary::trial[1:30,], row = trt, col = stage, 
                     percent = "cell", digits = c(0, 1))




Based on your comment, it is not obvious what rule or function should be applied on the percentages (it's not that it is unclear what you want, but rather the logic seems paradoxical). Also, I don't particularly agree with using different precision for the same variable. That said, I included a very laborious way of getting close to what you're describing. Basically if the rounded total is equal to the sum of the rounded individuals, then we use the rounded values, otherwise we keep the decimals (although it's not very consistent, see further down).

library(gtsummary)
library(dplyr)
library(tidyr)

gtbl <- tbl_cross(data = gtsummary::trial[1:30,], row = trt, col = stage, 
                  percent = "cell", digits = c(0, 1))

gtbl$table_body %>% 
  select(label, contains("stat")) %>% 
  separate_wider_regex(contains("stat"), 
                       c(v = ".*?", " \\(", p = ".*", "%\\)"), 
                       names_sep = "_") %>% 
  mutate(across(contains("stat"), ~as.numeric(.x))) %>% 
  mutate(tst_lgl = round(stat_0_p) == 
           rowSums(round(select(., matches("stat_[1-9]+_p"))))) %>% 
  mutate(across(contains("_p"), ~ifelse(tst_lgl, round(.x), .x)), 
         .keep = "unused") %>% 
  pivot_longer(-label, names_sep = "_(?=[^_]+$)", 
                       names_to = c("col", "name")) %>% 
  pivot_wider(id_cols = c(label, col)) %>% 
  mutate(value = ifelse(is.na(v), NA_character_, paste0(v, " (", p, "%)")), 
         .keep = "unused") %>% 
  pivot_wider(id_cols = label, names_from = col) %>% 
  right_join({gtbl$table_body %>% select(!contains("stat"))}, ., 
             by = join_by(label)) -> gtbl$table_body

gtbl

While looking at the percentages row-wise we are compliant with the "rule" described above, if we look at the columns, then it quickly falls apart. My advice, just use 1 or 2 decimals consistently. But if you must, you are better off just manually tampering with the table body.

发布评论

评论列表(0)

  1. 暂无评论