最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - How can I visualize which rows were kicked as outliers in my dataframe while looking at specific columns only? - Stack Overf

programmeradmin1浏览0评论

So I have the code all set to filter outliers from dataset using the IQR Method. The code I found from a different post ().

It works perfectly for my purpose but now I want to go back and figure out which rows were kicked but I don't know how to edit the code to do that.

Here is the original code used:

# IQR method functions

# @param x A numeric vector
# @param na.rm Whether to exclude NAs when computing quantiles

is_outlier <- function(x, na.rm = TRUE) {
  qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)

  lowerq <- qs[1]
  upperq <- qs[2]
  iqr = upperq - lowerq 

  extreme.threshold.upper = (iqr * 1.5) + upperq
  extreme.threshold.lower = lowerq - (iqr * 1.5)

  # Return logical vector
  x > extreme.threshold.upper | x < extreme.threshold.lower
}

# Remove rows with outliers in given columns
# Any row with at least 1 outlier will be removed
# @param df A data.frame
# @param cols Names of the columns of interest, defaults to all columns.

remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    cat("Removing outliers in column: ", col, " \n")
    df <- df[!is_outlier(df[[col]]),]
  }
  df
}

The issue is if I just run the first function I have columns I don't want to be analyzed for outliers which includes date columns but more importantly, ID columns. The ID columns are what I will use to figure out who was kicked and what information was omitted from downstream analysis.

Example data set (I only care about finding outliers in columns P-C, P-D, and P-E)

ID Date Sex Diagnosis P-A P-B P-C P-D P-E
1 1/2/23 1 1 105 70 200 15 50
2 1/4/18 1 1 40 50 150 15 12
3 1/9/20 1 1 70 20 70 10 12
4 2 150 150 80 6 44
5 7/9/15 2 3 148 60 900 7 56
6 6/10/24 2 115 10 1200 40 46
7 8/10/11 1 5 110 90 15 5 23
8 2 2 120 40 60 12 44
9 9/23/22 1 2 99 30 70 15 35

So I have the code all set to filter outliers from dataset using the IQR Method. The code I found from a different post (https://stackoverflow/a/60814481/29391336).

It works perfectly for my purpose but now I want to go back and figure out which rows were kicked but I don't know how to edit the code to do that.

Here is the original code used:

# IQR method functions

# @param x A numeric vector
# @param na.rm Whether to exclude NAs when computing quantiles

is_outlier <- function(x, na.rm = TRUE) {
  qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)

  lowerq <- qs[1]
  upperq <- qs[2]
  iqr = upperq - lowerq 

  extreme.threshold.upper = (iqr * 1.5) + upperq
  extreme.threshold.lower = lowerq - (iqr * 1.5)

  # Return logical vector
  x > extreme.threshold.upper | x < extreme.threshold.lower
}

# Remove rows with outliers in given columns
# Any row with at least 1 outlier will be removed
# @param df A data.frame
# @param cols Names of the columns of interest, defaults to all columns.

remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    cat("Removing outliers in column: ", col, " \n")
    df <- df[!is_outlier(df[[col]]),]
  }
  df
}

The issue is if I just run the first function I have columns I don't want to be analyzed for outliers which includes date columns but more importantly, ID columns. The ID columns are what I will use to figure out who was kicked and what information was omitted from downstream analysis.

Example data set (I only care about finding outliers in columns P-C, P-D, and P-E)

ID Date Sex Diagnosis P-A P-B P-C P-D P-E
1 1/2/23 1 1 105 70 200 15 50
2 1/4/18 1 1 40 50 150 15 12
3 1/9/20 1 1 70 20 70 10 12
4 2 150 150 80 6 44
5 7/9/15 2 3 148 60 900 7 56
6 6/10/24 2 115 10 1200 40 46
7 8/10/11 1 5 110 90 15 5 23
8 2 2 120 40 60 12 44
9 9/23/22 1 2 99 30 70 15 35

So, when running remove_outliers (the second and final function) I would do something like

columns_of_interest <- c("P-C", "P-D", "P-E")
remove_outliers(df, columns_of_interest)

Now I want to know what those outliers were.

After running dput() on my actual dataset here is the output:

structure(list(id = c(67L, 348L, 694L, 344L, 344L), age = c(79, 
76, 89, NA, NA), sex = c(1L, 2L, 2L, NA, NA), years_education = c(16L, 
16L, NA, 18L, NA), diagnosis = c(1L, 1L, 1L, 1L, 1L), diagnosis_2 = c(4L, 
NA, 21L, 22L, 21L), p-c = c(0.691, NA, 0.461, 0.782, 
0.812), p-d = c(94.9, NA, 112.5, 82.2, 78.4)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))
Share Improve this question edited Mar 18 at 20:28 dragon_cake asked Mar 18 at 18:41 dragon_cakedragon_cake 236 bronze badges 1
  • Note—aside from being inefficient—your for has a serious flaw in that it calculates outlier after removing rows, as I detail in my answer below. – jay.sf Commented Mar 19 at 7:22
Add a comment  | 

3 Answers 3

Reset to default 2

You could modify your remove_outliers to print out the IDs of outliers in each column.

remove_outliers <- function(df, cols = names(df), id_col) {
  outliers <- c()
  
  for(col in cols){
    cat("Removing outliers in column: ", col, " \n")
    removed_row_id <- df[[id_col]][is_outlier(df[[col]])]
    cat(id_col, "of rows removed:", removed_row_id, "\n")
    outliers <- append(outliers, removed_row_id)
  }
  
  outliers <- unique(outliers)
  
  df[!df[[id_col]] %in% outliers,]
}

Using the modified function you can get:

> remove_outliers(dat, cols = c("P-C", "P-D", "P-E"), id_col = "ID")

# output
# Removing outliers in column:  P-C  
# ID of rows removed: 5 6 
# Removing outliers in column:  P-D  
# ID of rows removed: 6 
# Removing outliers in column:  P-E  
# ID of rows removed:  
# A tibble: 7 × 9
     ID Date      Sex Diagnosis `P-A` `P-B` `P-C` `P-D` `P-E`
  <int> <chr>   <int>     <int> <int> <int> <int> <int> <int>
1     1 1/2/23      1         1   105    70   200    15    50
2     2 1/4/18      1         1    40    50   150    15    12
3     3 1/9/20      1         1    70    20    70    10    12
4     4 NA          2        NA   150   150    80     6    44
5     7 8/10/11     1         5   110    90    15     5    23
6     8 NA          2         2   120    40    60    12    44
7     9 9/23/22     1         2    99    30    70    15    35

data

dat <- structure(list(ID = 1:9, Date = c("1/2/23", "1/4/18", "1/9/20", 
NA, "7/9/15", "6/10/24", "8/10/11", NA, "9/23/22"), Sex = c(1L, 
1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L), Diagnosis = c(1L, 1L, 1L, NA, 
3L, NA, 5L, 2L, 2L), `P-A` = c(105L, 40L, 70L, 150L, 148L, 115L, 
110L, 120L, 99L), `P-B` = c(70L, 50L, 20L, 150L, 60L, 10L, 90L, 
40L, 30L), `P-C` = c(200L, 150L, 70L, 80L, 900L, 1200L, 15L, 
60L, 70L), `P-D` = c(15L, 15L, 10L, 6L, 7L, 40L, 5L, 12L, 15L
), `P-E` = c(50L, 12L, 12L, 44L, 56L, 46L, 23L, 44L, 35L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -9L))

Note, that your for loop approach has several issues: (1) it calculates outliers of column 2 of cols after rows have been already removed from df, which is what I would consider a serious bug; and (2) overwriting df many times is very inefficient. Moreover I would never set a nr.rm argument to TRUE by default as currently shown in your is_outlier()!

That said, you could use vapply to apply is_outlier() over selected columns, giving a boolean ("logical") matrix. rowSums less than length(cols) are outlier to be removed.

To trace which rows were removed afterwards, add an attribute in the style of na.omit() (i.e., row numbers with their names).

> remove_outliers <- function(df, cols=names(df), na.rm=FALSE) {
+   otl <- vapply(df[cols], is_outlier, logical(nrow(df)), na.rm=na.rm)
+   rmv <- rowSums(!otl, na.rm=na.rm) != length(cols)
+   out <- df[!rmv, ]
+   if (any(rmv)) {
+     attr(out, 'rm.row') <- setNames(which(rmv), rownames(df)[rmv])
+   }
+   out
+ }
> r <- remove_outliers(df=dat, cols=rm_cols)
> r
  ID    Date Sex Diagnosis P-A P-B P-C P-D P-E
1  1  1/2/23   1         1 105  70 200  15  50
2  2  1/4/18   1         1  40  50 150  15  12
3  3  1/9/20   1         1  70  20  70  10  12
4  4    <NA>   2        NA 150 150  80   6  44
7  7 8/10/11   1         5 110  90  15   5  23
8  8    <NA>   2         2 120  40  60  12  44
9  9 9/23/22   1         2  99  30  70  15  35
> attr(r, 'rm.row')
5 6 
5 6 

To show removed IDs, we could incorporate an id= argument expecting the name of an ID column, which, if its non-missing, adds an attribute of removed IDs. To prevent future annoyance, we can integrate a verbose flag, that shows messages (using cat for messages is bad style) only if set to TRUE.

> remove_outliers <- function(df, cols=names(df), id, na.rm=FALSE, verbose=FALSE) {
+   otl <- vapply(df[cols], is_outlier, logical(nrow(df)), na.rm=na.rm)
+   rmv <- rowSums(!otl, na.rm=na.rm) != length(cols)
+   out <- df[!rmv, ]
+   if (any(rmv)) {
+     if (!missing(id)) {
+       attr(out, 'rm.id') <- df[[id]][rmv]
+       if (verbose) {
+         message(sprintf('removed %ss: %s', id, toString(attr(out, 'rm.id'))))
+       }
+     } 
+     attr(out, 'rm.row') <- setNames(which(rmv), rownames(df)[rmv])
+   }
+   out
+ }
> r1 <- remove_outliers(df=dat, cols=rm_cols)
> attr(r1, 'rm.row')
5 6 
5 6 
> r2 <- remove_outliers(df=dat, cols=rm_cols, id='ID')
> attr(r2, 'rm.id')
[1] 5 6
> r3 <- remove_outliers(df=dat, cols=rm_cols, id='ID', verbose=TRUE)
removed IDs: 5, 6

The big advantage is that this doesn't add clutter to function output.


Data:

> dput(dat)
structure(list(ID = 1:9, Date = c("1/2/23", "1/4/18", "1/9/20", 
NA, "7/9/15", "6/10/24", "8/10/11", NA, "9/23/22"), Sex = c(1L, 
1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L), Diagnosis = c(1L, 1L, 1L, NA, 
3L, NA, 5L, 2L, 2L), `P-A` = c(105L, 40L, 70L, 150L, 148L, 115L, 
110L, 120L, 99L), `P-B` = c(70L, 50L, 20L, 150L, 60L, 10L, 90L, 
40L, 30L), `P-C` = c(200L, 150L, 70L, 80L, 900L, 1200L, 15L, 
60L, 70L), `P-D` = c(15L, 15L, 10L, 6L, 7L, 40L, 5L, 12L, 15L
), `P-E` = c(50L, 12L, 12L, 44L, 56L, 46L, 23L, 44L, 35L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -9L))

> dput(rm_cols)
c("P-C", "P-D", "P-E")

You can slightly modify is_outlier().

is_outlier2 = \(x, na.rm = TRUE, flag=9999) {
  qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
  lowerq = qs[1]
  upperq = qs[2]
  iqr = upperq - lowerq 
  extreme.threshold.upper = (iqr * 1.5) + upperq
  extreme.threshold.lower = lowerq - (iqr * 1.5)
  # adapted:
  i = x > extreme.threshold.upper | x < extreme.threshold.lower
  x[i] = flag
  x
}

Apply

coi = c("P-C", "P-D", "P-E")
dat[coi] = lapply(dat[coi], is_outlier2)

Inspect

> dat
  ID    Date Sex Diagnosis P-A P-B  P-C  P-D P-E
1  1  1/2/23   1         1 105  70  200   15  50
2  2  1/4/18   1         1  40  50  150   15  12
3  3  1/9/20   1         1  70  20   70   10  12
4  4    <NA>   2        NA 150 150   80    6  44
5  5  7/9/15   2         3 148  60 9999    7  56
6  6 6/10/24   2        NA 115  10 9999 9999  46
7  7 8/10/11   1         5 110  90   15    5  23
8  8    <NA>   2         2 120  40   60   12  44
9  9 9/23/22   1         2  99  30   70   15  35
which(dat == 9999, arr.ind = TRUE) |>
  as.data.frame() |>
  transform(col = names(dat)[col])
  row col
1   5 P-C
2   6 P-C
3   6 P-D

and remove afterwards

# dat = 
dat[rowSums(dat == 9999, na.rm = TRUE) < 1, ]
  ID    Date Sex Diagnosis P-A P-B P-C P-D P-E
1  1  1/2/23   1         1 105  70 200  15  50
2  2  1/4/18   1         1  40  50 150  15  12
3  3  1/9/20   1         1  70  20  70  10  12
4  4    <NA>   2        NA 150 150  80   6  44
7  7 8/10/11   1         5 110  90  15   5  23
8  8    <NA>   2         2 120  40  60  12  44
9  9 9/23/22   1         2  99  30  70  15  35

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论