r - How can I visualize which rows were kicked as outliers in my dataframe while looking at specific columns only?

So I have the code all set to filter outliers from dataset using the IQR Method. The code I found from a different post ().

It works perfectly for my purpose but now I want to go back and figure out which rows were kicked but I don't know how to edit the code to do that.

Here is the original code used:

# IQR method functions

# @param x A numeric vector
# @param na.rm Whether to exclude NAs when computing quantiles

is_outlier <- function(x, na.rm = TRUE) {
  qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)

  lowerq <- qs[1]
  upperq <- qs[2]
  iqr = upperq - lowerq 

  extreme.threshold.upper = (iqr * 1.5) + upperq
  extreme.threshold.lower = lowerq - (iqr * 1.5)

  # Return logical vector
  x > extreme.threshold.upper | x < extreme.threshold.lower
}

# Remove rows with outliers in given columns
# Any row with at least 1 outlier will be removed
# @param df A data.frame
# @param cols Names of the columns of interest, defaults to all columns.

remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    cat("Removing outliers in column: ", col, " \n")
    df <- df[!is_outlier(df[[col]]),]
  }
  df
}

The issue is if I just run the first function I have columns I don't want to be analyzed for outliers which includes date columns but more importantly, ID columns. The ID columns are what I will use to figure out who was kicked and what information was omitted from downstream analysis.

Example data set (I only care about finding outliers in columns P-C, P-D, and P-E)

ID	Date	Sex	Diagnosis	P-A	P-B	P-C	P-D	P-E
1	1/2/23	1	1	105	70	200	15	50
2	1/4/18	1	1	40	50	150	15	12
3	1/9/20	1	1	70	20	70	10	12
4		2		150	150	80	6	44
5	7/9/15	2	3	148	60	900	7	56
6	6/10/24	2		115	10	1200	40	46
7	8/10/11	1	5	110	90	15	5	23
8		2	2	120	40	60	12	44
9	9/23/22	1	2	99	30	70	15	35

So I have the code all set to filter outliers from dataset using the IQR Method. The code I found from a different post (https://stackoverflow/a/60814481/29391336).

It works perfectly for my purpose but now I want to go back and figure out which rows were kicked but I don't know how to edit the code to do that.

Here is the original code used:

# IQR method functions

# @param x A numeric vector
# @param na.rm Whether to exclude NAs when computing quantiles

is_outlier <- function(x, na.rm = TRUE) {
  qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)

  lowerq <- qs[1]
  upperq <- qs[2]
  iqr = upperq - lowerq 

  extreme.threshold.upper = (iqr * 1.5) + upperq
  extreme.threshold.lower = lowerq - (iqr * 1.5)

  # Return logical vector
  x > extreme.threshold.upper | x < extreme.threshold.lower
}

# Remove rows with outliers in given columns
# Any row with at least 1 outlier will be removed
# @param df A data.frame
# @param cols Names of the columns of interest, defaults to all columns.

remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    cat("Removing outliers in column: ", col, " \n")
    df <- df[!is_outlier(df[[col]]),]
  }
  df
}

Example data set (I only care about finding outliers in columns P-C, P-D, and P-E)

ID	Date	Sex	Diagnosis	P-A	P-B	P-C	P-D	P-E
1	1/2/23	1	1	105	70	200	15	50
2	1/4/18	1	1	40	50	150	15	12
3	1/9/20	1	1	70	20	70	10	12
4		2		150	150	80	6	44
5	7/9/15	2	3	148	60	900	7	56
6	6/10/24	2		115	10	1200	40	46
7	8/10/11	1	5	110	90	15	5	23
8		2	2	120	40	60	12	44
9	9/23/22	1	2	99	30	70	15	35

So, when running remove_outliers (the second and final function) I would do something like

columns_of_interest <- c("P-C", "P-D", "P-E")
remove_outliers(df, columns_of_interest)

Now I want to know what those outliers were.

After running dput() on my actual dataset here is the output:

structure(list(id = c(67L, 348L, 694L, 344L, 344L), age = c(79, 
76, 89, NA, NA), sex = c(1L, 2L, 2L, NA, NA), years_education = c(16L, 
16L, NA, 18L, NA), diagnosis = c(1L, 1L, 1L, 1L, 1L), diagnosis_2 = c(4L, 
NA, 21L, 22L, 21L), p-c = c(0.691, NA, 0.461, 0.782, 
0.812), p-d = c(94.9, NA, 112.5, 82.2, 78.4)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))

Share Improve this question edited Mar 18 at 20:28 asked Mar 18 at 18:41 dragon_cake 236 bronze badges

Note—aside from being inefficient—your for has a serious flaw in that it calculates outlier after removing rows, as I detail in my answer below. – jay.sf Commented Mar 19 at 7:22

Add a comment |

3 Answers 3

Sorted by: Reset to default 2

You could modify your remove_outliers to print out the IDs of outliers in each column.

remove_outliers <- function(df, cols = names(df), id_col) {
  outliers <- c()
  
  for(col in cols){
    cat("Removing outliers in column: ", col, " \n")
    removed_row_id <- df[[id_col]][is_outlier(df[[col]])]
    cat(id_col, "of rows removed:", removed_row_id, "\n")
    outliers <- append(outliers, removed_row_id)
  }
  
  outliers <- unique(outliers)
  
  df[!df[[id_col]] %in% outliers,]
}

Using the modified function you can get:

> remove_outliers(dat, cols = c("P-C", "P-D", "P-E"), id_col = "ID")

# output
# Removing outliers in column:  P-C  
# ID of rows removed: 5 6 
# Removing outliers in column:  P-D  
# ID of rows removed: 6 
# Removing outliers in column:  P-E  
# ID of rows removed:  
# A tibble: 7 × 9
     ID Date      Sex Diagnosis `P-A` `P-B` `P-C` `P-D` `P-E`
  <int> <chr>   <int>     <int> <int> <int> <int> <int> <int>
1     1 1/2/23      1         1   105    70   200    15    50
2     2 1/4/18      1         1    40    50   150    15    12
3     3 1/9/20      1         1    70    20    70    10    12
4     4 NA          2        NA   150   150    80     6    44
5     7 8/10/11     1         5   110    90    15     5    23
6     8 NA          2         2   120    40    60    12    44
7     9 9/23/22     1         2    99    30    70    15    35

data

dat <- structure(list(ID = 1:9, Date = c("1/2/23", "1/4/18", "1/9/20", 
NA, "7/9/15", "6/10/24", "8/10/11", NA, "9/23/22"), Sex = c(1L, 
1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L), Diagnosis = c(1L, 1L, 1L, NA, 
3L, NA, 5L, 2L, 2L), `P-A` = c(105L, 40L, 70L, 150L, 148L, 115L, 
110L, 120L, 99L), `P-B` = c(70L, 50L, 20L, 150L, 60L, 10L, 90L, 
40L, 30L), `P-C` = c(200L, 150L, 70L, 80L, 900L, 1200L, 15L, 
60L, 70L), `P-D` = c(15L, 15L, 10L, 6L, 7L, 40L, 5L, 12L, 15L
), `P-E` = c(50L, 12L, 12L, 44L, 56L, 46L, 23L, 44L, 35L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -9L))

Note, that your for loop approach has several issues: (1) it calculates outliers of column 2 of cols after rows have been already removed from df, which is what I would consider a serious bug; and (2) overwriting df many times is very inefficient. Moreover I would never set a nr.rm argument to TRUE by default as currently shown in your is_outlier()!

That said, you could use vapply to apply is_outlier() over selected columns, giving a boolean ("logical") matrix. rowSums less than length(cols) are outlier to be removed.

To trace which rows were removed afterwards, add an attribute in the style of na.omit() (i.e., row numbers with their names).

> remove_outliers <- function(df, cols=names(df), na.rm=FALSE) {
+   otl <- vapply(df[cols], is_outlier, logical(nrow(df)), na.rm=na.rm)
+   rmv <- rowSums(!otl, na.rm=na.rm) != length(cols)
+   out <- df[!rmv, ]
+   if (any(rmv)) {
+     attr(out, 'rm.row') <- setNames(which(rmv), rownames(df)[rmv])
+   }
+   out
+ }
> r <- remove_outliers(df=dat, cols=rm_cols)
> r
  ID    Date Sex Diagnosis P-A P-B P-C P-D P-E
1  1  1/2/23   1         1 105  70 200  15  50
2  2  1/4/18   1         1  40  50 150  15  12
3  3  1/9/20   1         1  70  20  70  10  12
4  4    <NA>   2        NA 150 150  80   6  44
7  7 8/10/11   1         5 110  90  15   5  23
8  8    <NA>   2         2 120  40  60  12  44
9  9 9/23/22   1         2  99  30  70  15  35
> attr(r, 'rm.row')
5 6 
5 6

To show removed IDs, we could incorporate an id= argument expecting the name of an ID column, which, if its non-missing, adds an attribute of removed IDs. To prevent future annoyance, we can integrate a verbose flag, that shows messages (using cat for messages is bad style) only if set to TRUE.

> remove_outliers <- function(df, cols=names(df), id, na.rm=FALSE, verbose=FALSE) {
+   otl <- vapply(df[cols], is_outlier, logical(nrow(df)), na.rm=na.rm)
+   rmv <- rowSums(!otl, na.rm=na.rm) != length(cols)
+   out <- df[!rmv, ]
+   if (any(rmv)) {
+     if (!missing(id)) {
+       attr(out, 'rm.id') <- df[[id]][rmv]
+       if (verbose) {
+         message(sprintf('removed %ss: %s', id, toString(attr(out, 'rm.id'))))
+       }
+     } 
+     attr(out, 'rm.row') <- setNames(which(rmv), rownames(df)[rmv])
+   }
+   out
+ }
> r1 <- remove_outliers(df=dat, cols=rm_cols)
> attr(r1, 'rm.row')
5 6 
5 6 
> r2 <- remove_outliers(df=dat, cols=rm_cols, id='ID')
> attr(r2, 'rm.id')
[1] 5 6
> r3 <- remove_outliers(df=dat, cols=rm_cols, id='ID', verbose=TRUE)
removed IDs: 5, 6

The big advantage is that this doesn't add clutter to function output.

Data:

> dput(dat)
structure(list(ID = 1:9, Date = c("1/2/23", "1/4/18", "1/9/20", 
NA, "7/9/15", "6/10/24", "8/10/11", NA, "9/23/22"), Sex = c(1L, 
1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L), Diagnosis = c(1L, 1L, 1L, NA, 
3L, NA, 5L, 2L, 2L), `P-A` = c(105L, 40L, 70L, 150L, 148L, 115L, 
110L, 120L, 99L), `P-B` = c(70L, 50L, 20L, 150L, 60L, 10L, 90L, 
40L, 30L), `P-C` = c(200L, 150L, 70L, 80L, 900L, 1200L, 15L, 
60L, 70L), `P-D` = c(15L, 15L, 10L, 6L, 7L, 40L, 5L, 12L, 15L
), `P-E` = c(50L, 12L, 12L, 44L, 56L, 46L, 23L, 44L, 35L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -9L))

> dput(rm_cols)
c("P-C", "P-D", "P-E")

You can slightly modify is_outlier().

is_outlier2 = \(x, na.rm = TRUE, flag=9999) {
  qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
  lowerq = qs[1]
  upperq = qs[2]
  iqr = upperq - lowerq 
  extreme.threshold.upper = (iqr * 1.5) + upperq
  extreme.threshold.lower = lowerq - (iqr * 1.5)
  # adapted:
  i = x > extreme.threshold.upper | x < extreme.threshold.lower
  x[i] = flag
  x
}

Apply

coi = c("P-C", "P-D", "P-E")
dat[coi] = lapply(dat[coi], is_outlier2)

Inspect

> dat
  ID    Date Sex Diagnosis P-A P-B  P-C  P-D P-E
1  1  1/2/23   1         1 105  70  200   15  50
2  2  1/4/18   1         1  40  50  150   15  12
3  3  1/9/20   1         1  70  20   70   10  12
4  4    <NA>   2        NA 150 150   80    6  44
5  5  7/9/15   2         3 148  60 9999    7  56
6  6 6/10/24   2        NA 115  10 9999 9999  46
7  7 8/10/11   1         5 110  90   15    5  23
8  8    <NA>   2         2 120  40   60   12  44
9  9 9/23/22   1         2  99  30   70   15  35

which(dat == 9999, arr.ind = TRUE) |>
  as.data.frame() |>
  transform(col = names(dat)[col])

  row col
1   5 P-C
2   6 P-C
3   6 P-D

and remove afterwards

# dat = 
dat[rowSums(dat == 9999, na.rm = TRUE) < 1, ]

  ID    Date Sex Diagnosis P-A P-B P-C P-D P-E
1  1  1/2/23   1         1 105  70 200  15  50
2  2  1/4/18   1         1  40  50 150  15  12
3  3  1/9/20   1         1  70  20  70  10  12
4  4    <NA>   2        NA 150 150  80   6  44
7  7 8/10/11   1         5 110  90  15   5  23
8  8    <NA>   2         2 120  40  60  12  44
9  9 9/23/22   1         2  99  30  70  15  35

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - How can I visualize which rows were kicked as outliers in my dataframe while looking at specific columns only? - Stack Overf

3 Answers 3

data

与本文相关的文章

评论列表(0)