最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

string - how to find unique characters both in forward and backward order in R - Stack Overflow

programmeradmin0浏览0评论

I have a list of characters like this:

list <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')

I want to have a list of unique characters with no more 'b_a', 'c_b'. I have tried unique() but it cannot remove 'b_a' and 'c_b'. I hope to receive some help about this. Many thanks!

I have a list of characters like this:

list <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')

I want to have a list of unique characters with no more 'b_a', 'c_b'. I have tried unique() but it cannot remove 'b_a' and 'c_b'. I hope to receive some help about this. Many thanks!

Share Improve this question edited Feb 5 at 22:39 ThomasIsCoding 102k9 gold badges36 silver badges101 bronze badges asked Feb 5 at 11:58 user21390049user21390049 1294 bronze badges 2
  • 1 You mean the order is not important, just the characters? – user2974951 Commented Feb 5 at 12:01
  • Yes just the characters. Apologies for my confusing question title :) @user2974951 – user21390049 Commented Feb 5 at 12:04
Add a comment  | 

4 Answers 4

Reset to default 9

You could use strsplit() to split the two characters apart, then sort them in alphabetical order and paste them back together. That will turn "b_a" into "a_b". Then you could get the unique values of the sorted strings.

l <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')

ll <- strsplit(l, "_")
ll <- sapply(ll, \(x)paste(sort(x), collapse="_"))
unique(ll)
#> [1] "a_b" "a_c" "a_d" "a_e" "b_c"

Created on 2025-02-05 with reprex v2.1.1

This is overkill for this simple example, but conceptually I would think about this as an undirected graph. We can use strcapture() to create a data frame from your vector l, and use igraph::graph_from_data_frame() to construct the graph:

library(igraph)
g <- strcapture("(.+)_(.+)", l, data.frame(x = character(), y = character())) |>
    graph_from_data_frame(directed = FALSE) |>
    simplify() # remove duplicate edges

If we plot(g) we'll see something like:

We can then extract the edges and paste() them together:

d <- as_data_frame(g, what="edges")
paste0(d$from, "_", d$to)
# [1] "a_b" "a_c" "a_d" "a_e" "b_c"

Another option would be to sort characters in each string of your list first, and remove duplicated entries:

l <- c('a_b', 'a_c', 'a_d', 'a_e', 'a_b', 'b_a', 'b_c', 'b_c','c_b')

l[!duplicated(Tmisc::strSort(l))]
#[1] "a_b" "a_c" "a_d" "a_e" "b_c"

Yet another way to do it, using the base R utf8ToInt to sort strings:

l[!duplicated(lapply(l, \(x) sort(utf8ToInt(x))))]
#[1] "a_b" "a_c" "a_d" "a_e" "b_c"

Borrowing data from @DaveArmstrong's solution, you can try

  • Option 1
with(
    read.table(text = l, sep = "_"),
    unique(paste(pmin(V1, V2), pmax(V1, V2), sep = "_"))
)
  • Option 2
idx <- seq_along(l) < match(l, sub("(\\w+)_(\\w+)", "\\2_\\1", l))
unique(l[replace(idx, is.na(idx), TRUE)])

which gives

[1] "a_b" "a_c" "a_d" "a_e" "b_c"
发布评论

评论列表(0)

  1. 暂无评论