So I have the code all set to filter outliers from dataset using the IQR Method. The code I found from a different post ().
It works perfectly for my purpose but now I want to go back and figure out which rows were kicked but I don't know how to edit the code to do that.
Here is the original code used:
# IQR method functions
# @param x A numeric vector
# @param na.rm Whether to exclude NAs when computing quantiles
is_outlier <- function(x, na.rm = TRUE) {
qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
lowerq <- qs[1]
upperq <- qs[2]
iqr = upperq - lowerq
extreme.threshold.upper = (iqr * 1.5) + upperq
extreme.threshold.lower = lowerq - (iqr * 1.5)
# Return logical vector
x > extreme.threshold.upper | x < extreme.threshold.lower
}
# Remove rows with outliers in given columns
# Any row with at least 1 outlier will be removed
# @param df A data.frame
# @param cols Names of the columns of interest, defaults to all columns.
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
cat("Removing outliers in column: ", col, " \n")
df <- df[!is_outlier(df[[col]]),]
}
df
}
The issue is if I just run the first function I have columns I don't want to be analyzed for outliers which includes date columns but more importantly, ID columns. The ID columns are what I will use to figure out who was kicked and what information was omitted from downstream analysis.
Example data set (I only care about finding outliers in columns P-C
, P-D
, and P-E
)
ID | Date | Sex | Diagnosis | P-A | P-B | P-C | P-D | P-E |
---|---|---|---|---|---|---|---|---|
1 | 1/2/23 | 1 | 1 | 105 | 70 | 200 | 15 | 50 |
2 | 1/4/18 | 1 | 1 | 40 | 50 | 150 | 15 | 12 |
3 | 1/9/20 | 1 | 1 | 70 | 20 | 70 | 10 | 12 |
4 | 2 | 150 | 150 | 80 | 6 | 44 | ||
5 | 7/9/15 | 2 | 3 | 148 | 60 | 900 | 7 | 56 |
6 | 6/10/24 | 2 | 115 | 10 | 1200 | 40 | 46 | |
7 | 8/10/11 | 1 | 5 | 110 | 90 | 15 | 5 | 23 |
8 | 2 | 2 | 120 | 40 | 60 | 12 | 44 | |
9 | 9/23/22 | 1 | 2 | 99 | 30 | 70 | 15 | 35 |
So I have the code all set to filter outliers from dataset using the IQR Method. The code I found from a different post (https://stackoverflow/a/60814481/29391336).
It works perfectly for my purpose but now I want to go back and figure out which rows were kicked but I don't know how to edit the code to do that.
Here is the original code used:
# IQR method functions
# @param x A numeric vector
# @param na.rm Whether to exclude NAs when computing quantiles
is_outlier <- function(x, na.rm = TRUE) {
qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
lowerq <- qs[1]
upperq <- qs[2]
iqr = upperq - lowerq
extreme.threshold.upper = (iqr * 1.5) + upperq
extreme.threshold.lower = lowerq - (iqr * 1.5)
# Return logical vector
x > extreme.threshold.upper | x < extreme.threshold.lower
}
# Remove rows with outliers in given columns
# Any row with at least 1 outlier will be removed
# @param df A data.frame
# @param cols Names of the columns of interest, defaults to all columns.
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
cat("Removing outliers in column: ", col, " \n")
df <- df[!is_outlier(df[[col]]),]
}
df
}
The issue is if I just run the first function I have columns I don't want to be analyzed for outliers which includes date columns but more importantly, ID columns. The ID columns are what I will use to figure out who was kicked and what information was omitted from downstream analysis.
Example data set (I only care about finding outliers in columns P-C
, P-D
, and P-E
)
ID | Date | Sex | Diagnosis | P-A | P-B | P-C | P-D | P-E |
---|---|---|---|---|---|---|---|---|
1 | 1/2/23 | 1 | 1 | 105 | 70 | 200 | 15 | 50 |
2 | 1/4/18 | 1 | 1 | 40 | 50 | 150 | 15 | 12 |
3 | 1/9/20 | 1 | 1 | 70 | 20 | 70 | 10 | 12 |
4 | 2 | 150 | 150 | 80 | 6 | 44 | ||
5 | 7/9/15 | 2 | 3 | 148 | 60 | 900 | 7 | 56 |
6 | 6/10/24 | 2 | 115 | 10 | 1200 | 40 | 46 | |
7 | 8/10/11 | 1 | 5 | 110 | 90 | 15 | 5 | 23 |
8 | 2 | 2 | 120 | 40 | 60 | 12 | 44 | |
9 | 9/23/22 | 1 | 2 | 99 | 30 | 70 | 15 | 35 |
So, when running remove_outliers
(the second and final function) I would do something like
columns_of_interest <- c("P-C", "P-D", "P-E")
remove_outliers(df, columns_of_interest)
Now I want to know what those outliers were.
After running dput()
on my actual dataset here is the output:
structure(list(id = c(67L, 348L, 694L, 344L, 344L), age = c(79,
76, 89, NA, NA), sex = c(1L, 2L, 2L, NA, NA), years_education = c(16L,
16L, NA, 18L, NA), diagnosis = c(1L, 1L, 1L, 1L, 1L), diagnosis_2 = c(4L,
NA, 21L, 22L, 21L), p-c = c(0.691, NA, 0.461, 0.782,
0.812), p-d = c(94.9, NA, 112.5, 82.2, 78.4)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
Share
Improve this question
edited Mar 18 at 20:28
dragon_cake
asked Mar 18 at 18:41
dragon_cakedragon_cake
236 bronze badges
1
|
3 Answers
Reset to default 2You could modify your remove_outliers
to print out the IDs of outliers in each column.
remove_outliers <- function(df, cols = names(df), id_col) {
outliers <- c()
for(col in cols){
cat("Removing outliers in column: ", col, " \n")
removed_row_id <- df[[id_col]][is_outlier(df[[col]])]
cat(id_col, "of rows removed:", removed_row_id, "\n")
outliers <- append(outliers, removed_row_id)
}
outliers <- unique(outliers)
df[!df[[id_col]] %in% outliers,]
}
Using the modified function you can get:
> remove_outliers(dat, cols = c("P-C", "P-D", "P-E"), id_col = "ID")
# output
# Removing outliers in column: P-C
# ID of rows removed: 5 6
# Removing outliers in column: P-D
# ID of rows removed: 6
# Removing outliers in column: P-E
# ID of rows removed:
# A tibble: 7 × 9
ID Date Sex Diagnosis `P-A` `P-B` `P-C` `P-D` `P-E`
<int> <chr> <int> <int> <int> <int> <int> <int> <int>
1 1 1/2/23 1 1 105 70 200 15 50
2 2 1/4/18 1 1 40 50 150 15 12
3 3 1/9/20 1 1 70 20 70 10 12
4 4 NA 2 NA 150 150 80 6 44
5 7 8/10/11 1 5 110 90 15 5 23
6 8 NA 2 2 120 40 60 12 44
7 9 9/23/22 1 2 99 30 70 15 35
data
dat <- structure(list(ID = 1:9, Date = c("1/2/23", "1/4/18", "1/9/20",
NA, "7/9/15", "6/10/24", "8/10/11", NA, "9/23/22"), Sex = c(1L,
1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L), Diagnosis = c(1L, 1L, 1L, NA,
3L, NA, 5L, 2L, 2L), `P-A` = c(105L, 40L, 70L, 150L, 148L, 115L,
110L, 120L, 99L), `P-B` = c(70L, 50L, 20L, 150L, 60L, 10L, 90L,
40L, 30L), `P-C` = c(200L, 150L, 70L, 80L, 900L, 1200L, 15L,
60L, 70L), `P-D` = c(15L, 15L, 10L, 6L, 7L, 40L, 5L, 12L, 15L
), `P-E` = c(50L, 12L, 12L, 44L, 56L, 46L, 23L, 44L, 35L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L))
Note, that your for
loop approach has several issues: (1) it calculates outliers of column 2 of cols
after rows have been already removed from df
, which is what I would consider a serious bug; and (2) overwriting df
many times is very inefficient. Moreover I would never set a nr.rm
argument to TRUE
by default as currently shown in your is_outlier()
!
That said, you could use vapply
to apply is_outlier()
over selected columns, giving a boolean ("logical") matrix. rowSums
less than length(cols)
are outlier to be removed.
To trace which rows were removed afterwards, add an attr
ibute in the style of na.omit()
(i.e., row numbers with their names).
> remove_outliers <- function(df, cols=names(df), na.rm=FALSE) {
+ otl <- vapply(df[cols], is_outlier, logical(nrow(df)), na.rm=na.rm)
+ rmv <- rowSums(!otl, na.rm=na.rm) != length(cols)
+ out <- df[!rmv, ]
+ if (any(rmv)) {
+ attr(out, 'rm.row') <- setNames(which(rmv), rownames(df)[rmv])
+ }
+ out
+ }
> r <- remove_outliers(df=dat, cols=rm_cols)
> r
ID Date Sex Diagnosis P-A P-B P-C P-D P-E
1 1 1/2/23 1 1 105 70 200 15 50
2 2 1/4/18 1 1 40 50 150 15 12
3 3 1/9/20 1 1 70 20 70 10 12
4 4 <NA> 2 NA 150 150 80 6 44
7 7 8/10/11 1 5 110 90 15 5 23
8 8 <NA> 2 2 120 40 60 12 44
9 9 9/23/22 1 2 99 30 70 15 35
> attr(r, 'rm.row')
5 6
5 6
To show removed IDs, we could incorporate an id=
argument expecting the name of an ID column, which, if its non-missing, adds an attribute of removed IDs. To prevent future annoyance, we can integrate a verbose
flag, that shows message
s (using cat
for messages is bad style) only if set to TRUE
.
> remove_outliers <- function(df, cols=names(df), id, na.rm=FALSE, verbose=FALSE) {
+ otl <- vapply(df[cols], is_outlier, logical(nrow(df)), na.rm=na.rm)
+ rmv <- rowSums(!otl, na.rm=na.rm) != length(cols)
+ out <- df[!rmv, ]
+ if (any(rmv)) {
+ if (!missing(id)) {
+ attr(out, 'rm.id') <- df[[id]][rmv]
+ if (verbose) {
+ message(sprintf('removed %ss: %s', id, toString(attr(out, 'rm.id'))))
+ }
+ }
+ attr(out, 'rm.row') <- setNames(which(rmv), rownames(df)[rmv])
+ }
+ out
+ }
> r1 <- remove_outliers(df=dat, cols=rm_cols)
> attr(r1, 'rm.row')
5 6
5 6
> r2 <- remove_outliers(df=dat, cols=rm_cols, id='ID')
> attr(r2, 'rm.id')
[1] 5 6
> r3 <- remove_outliers(df=dat, cols=rm_cols, id='ID', verbose=TRUE)
removed IDs: 5, 6
The big advantage is that this doesn't add clutter to function output.
Data:
> dput(dat)
structure(list(ID = 1:9, Date = c("1/2/23", "1/4/18", "1/9/20",
NA, "7/9/15", "6/10/24", "8/10/11", NA, "9/23/22"), Sex = c(1L,
1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L), Diagnosis = c(1L, 1L, 1L, NA,
3L, NA, 5L, 2L, 2L), `P-A` = c(105L, 40L, 70L, 150L, 148L, 115L,
110L, 120L, 99L), `P-B` = c(70L, 50L, 20L, 150L, 60L, 10L, 90L,
40L, 30L), `P-C` = c(200L, 150L, 70L, 80L, 900L, 1200L, 15L,
60L, 70L), `P-D` = c(15L, 15L, 10L, 6L, 7L, 40L, 5L, 12L, 15L
), `P-E` = c(50L, 12L, 12L, 44L, 56L, 46L, 23L, 44L, 35L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L))
> dput(rm_cols)
c("P-C", "P-D", "P-E")
You can slightly modify is_outlier()
.
is_outlier2 = \(x, na.rm = TRUE, flag=9999) {
qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
lowerq = qs[1]
upperq = qs[2]
iqr = upperq - lowerq
extreme.threshold.upper = (iqr * 1.5) + upperq
extreme.threshold.lower = lowerq - (iqr * 1.5)
# adapted:
i = x > extreme.threshold.upper | x < extreme.threshold.lower
x[i] = flag
x
}
Apply
coi = c("P-C", "P-D", "P-E")
dat[coi] = lapply(dat[coi], is_outlier2)
Inspect
> dat
ID Date Sex Diagnosis P-A P-B P-C P-D P-E
1 1 1/2/23 1 1 105 70 200 15 50
2 2 1/4/18 1 1 40 50 150 15 12
3 3 1/9/20 1 1 70 20 70 10 12
4 4 <NA> 2 NA 150 150 80 6 44
5 5 7/9/15 2 3 148 60 9999 7 56
6 6 6/10/24 2 NA 115 10 9999 9999 46
7 7 8/10/11 1 5 110 90 15 5 23
8 8 <NA> 2 2 120 40 60 12 44
9 9 9/23/22 1 2 99 30 70 15 35
which(dat == 9999, arr.ind = TRUE) |>
as.data.frame() |>
transform(col = names(dat)[col])
row col
1 5 P-C
2 6 P-C
3 6 P-D
and remove afterwards
# dat =
dat[rowSums(dat == 9999, na.rm = TRUE) < 1, ]
ID Date Sex Diagnosis P-A P-B P-C P-D P-E
1 1 1/2/23 1 1 105 70 200 15 50
2 2 1/4/18 1 1 40 50 150 15 12
3 3 1/9/20 1 1 70 20 70 10 12
4 4 <NA> 2 NA 150 150 80 6 44
7 7 8/10/11 1 5 110 90 15 5 23
8 8 <NA> 2 2 120 40 60 12 44
9 9 9/23/22 1 2 99 30 70 15 35
for
has a serious flaw in that it calculates outlier after removing rows, as I detail in my answer below. – jay.sf Commented Mar 19 at 7:22