I want to demonstrate multivariate outliers with a PCA biplot, by labeling points with large Malahanobis D^2, but only label those points. The idea is that outliers usually appear on the smallest PCA dimensions.
I'm using a cleaned up version of the palmerpenguins
data. I want to label the 3 points identified as noteworthy
.
library(heplots)
library(dplyr)
library(ggplot2)
library(ggbiplot)
library(factoextra)
data(peng, package="heplots")
# find potential multivariate outliers
DSQ <- heplots::Mahalanobis(peng[, 3:6])
noteworthy <- order(DSQ, decreasing = TRUE)[1:3] |> print()
[1] 283 10 35
ggbiplot
doesn't allow points and labels.
Here is an attempt using factoextra::fviz_pca_biplot
. I create a vector lab
with just those 3 case numbers non-blank. But when I tried using that in the call, there are no labels whatsoever in the plot.
peng.pca <- prcomp (~ bill_length + bill_depth + flipper_length + body_mass,
data=peng, scale. = TRUE)
# create vector of labels, blank except for the noteworthy
lab <- 1:nrow(peng)
lab <- ifelse(lab %in% noteworthy, lab, "")
fviz_pca_biplot(
peng.pca,
axes = 3:4,
habillage = peng$species,
addEllipses = TRUE, ellipse.level = 0.68,
palette = peng.colors("dark"),
arrowsize = 1.5, col.var = "black", labelsize=4,
# label = lab
) +
theme(legend.position = "top")
I want to demonstrate multivariate outliers with a PCA biplot, by labeling points with large Malahanobis D^2, but only label those points. The idea is that outliers usually appear on the smallest PCA dimensions.
I'm using a cleaned up version of the palmerpenguins
data. I want to label the 3 points identified as noteworthy
.
library(heplots)
library(dplyr)
library(ggplot2)
library(ggbiplot)
library(factoextra)
data(peng, package="heplots")
# find potential multivariate outliers
DSQ <- heplots::Mahalanobis(peng[, 3:6])
noteworthy <- order(DSQ, decreasing = TRUE)[1:3] |> print()
[1] 283 10 35
ggbiplot
doesn't allow points and labels.
Here is an attempt using factoextra::fviz_pca_biplot
. I create a vector lab
with just those 3 case numbers non-blank. But when I tried using that in the call, there are no labels whatsoever in the plot.
peng.pca <- prcomp (~ bill_length + bill_depth + flipper_length + body_mass,
data=peng, scale. = TRUE)
# create vector of labels, blank except for the noteworthy
lab <- 1:nrow(peng)
lab <- ifelse(lab %in% noteworthy, lab, "")
fviz_pca_biplot(
peng.pca,
axes = 3:4,
habillage = peng$species,
addEllipses = TRUE, ellipse.level = 0.68,
palette = peng.colors("dark"),
arrowsize = 1.5, col.var = "black", labelsize=4,
# label = lab
) +
theme(legend.position = "top")
Share
Improve this question
asked Feb 7 at 16:44
user101089user101089
3,9921 gold badge31 silver badges60 bronze badges
2 Answers
Reset to default 1You can use a geom_text
layer where you add your label
vector as a column to the global data using e.g. dplyr::bind_cols
:
library(heplots)
library(dplyr, warn = FALSE)
library(ggplot2)
library(factoextra)
fviz_pca_biplot(
peng.pca,
axes = 3:4,
habillage = peng$species,
addEllipses = TRUE,
ellipse.level = 0.68,
label = "none",
# palette = peng.colors("dark"),
arrowsize = 1.5,
col.var = "black",
) +
geom_text(
data = ~ dplyr::bind_cols(., label),
aes(label = label),
vjust = 0,
nudge_y = .05
) +
theme(legend.position = "top")
I'm the maintainer of the ggbiplot
pkg, so I dug in and modified the code for ggbiplot()
to do what I want here. I added a geom.ind
argument that allows geom.ind = c("point", "text")
. For testing purposes, this is still on a new geoms
branch in the package.
#remotes::install_github("friendly/ggbiplot", ref = "geoms")
# adjust variable names to fold at '_'
vn <- rownames(peng.pca$rotation)
vn <- gsub("_", "\n", vn)
rownames(peng.pca$rotation) <- vn
ggbiplot(peng.pca,
choices = 3:4,
groups = peng$species,
ellipse = TRUE, ellipse.alpha = 0.1,
circle = TRUE,
var.factor = 4.5,
geom.ind = c("point", "text"),
point.size = 2,
labels = lab, labels.size = 6,
varname.size = 5,
clip = "off") +
theme_minimal(base_size = 14) +
# theme_penguins("dark") +
theme(legend.direction = 'horizontal', legend.position = 'top')
This gives: