I have a list of characters like this:
list <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')
I want to have a list of unique characters with no more 'b_a', 'c_b'. I have tried unique() but it cannot remove 'b_a' and 'c_b'. I hope to receive some help about this. Many thanks!
I have a list of characters like this:
list <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')
I want to have a list of unique characters with no more 'b_a', 'c_b'. I have tried unique() but it cannot remove 'b_a' and 'c_b'. I hope to receive some help about this. Many thanks!
Share Improve this question edited Feb 5 at 22:39 ThomasIsCoding 102k9 gold badges36 silver badges101 bronze badges asked Feb 5 at 11:58 user21390049user21390049 1294 bronze badges 2- 1 You mean the order is not important, just the characters? – user2974951 Commented Feb 5 at 12:01
- Yes just the characters. Apologies for my confusing question title :) @user2974951 – user21390049 Commented Feb 5 at 12:04
4 Answers
Reset to default 9You could use strsplit()
to split the two characters apart, then sort them in alphabetical order and paste them back together. That will turn "b_a"
into "a_b"
. Then you could get the unique values of the sorted strings.
l <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')
ll <- strsplit(l, "_")
ll <- sapply(ll, \(x)paste(sort(x), collapse="_"))
unique(ll)
#> [1] "a_b" "a_c" "a_d" "a_e" "b_c"
Created on 2025-02-05 with reprex v2.1.1
This is overkill for this simple example, but conceptually I would think about this as an undirected graph. We can use strcapture()
to create a data frame from your vector l
, and use igraph::graph_from_data_frame()
to construct the graph:
library(igraph)
g <- strcapture("(.+)_(.+)", l, data.frame(x = character(), y = character())) |>
graph_from_data_frame(directed = FALSE) |>
simplify() # remove duplicate edges
If we plot(g)
we'll see something like:
We can then extract the edges and paste()
them together:
d <- as_data_frame(g, what="edges")
paste0(d$from, "_", d$to)
# [1] "a_b" "a_c" "a_d" "a_e" "b_c"
Another option would be to sort characters in each string of your list first, and remove duplicated
entries:
l <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')
l[!duplicated(Tmisc::strSort(l))]
#[1] "a_b" "a_c" "a_d" "a_e" "b_c"
Yet another way to do it, using the base R utf8ToInt
to sort strings:
l[!duplicated(lapply(l, \(x) sort(utf8ToInt(x))))]
#[1] "a_b" "a_c" "a_d" "a_e" "b_c"
Borrowing data from @DaveArmstrong's solution, you can try
- Option 1
with(
read.table(text = l, sep = "_"),
unique(paste(pmin(V1, V2), pmax(V1, V2), sep = "_"))
)
- Option 2
idx <- seq_along(l) < match(l, sub("(\\w+)_(\\w+)", "\\2_\\1", l))
unique(l[replace(idx, is.na(idx), TRUE)])
which gives
[1] "a_b" "a_c" "a_d" "a_e" "b_c"