最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - How can I find common values in multiple dataframes? - Stack Overflow

programmeradmin0浏览0评论

I have databases of previous addresses for individuals and I want to identify if two or more people have lived in the same state and also who those people are. Right how the ID is just in the name of the database.

A mock data set is below.

Anna <- data.frame(Name = c('124 anne st', '400 rose pl', '45 prince st'), 
                  city= c('san francisco', 'brooklyn', 'minneapolis'),
                  state= c('CA', 'NY', 'MN'))

Yusuf <- data.frame(Name = c('12 fort st', '56 melrose pl', '123 main st'), 
                  city= c('new haven', 'atlanta', 'minneapolis'),
                  state= c('CT', 'GA', 'MN'))

Robin <- data.frame(Name = c('28 greene st', '67 apple pl', '222 bellvue ave'), 
                  city= c('houston', 'new york', 'minneapolis'),
                  state= c('TX', 'NY', 'MN'))

What I would like to identify is that

  1. anna and robin have lived in NY and
  2. anna, robin, and yusuf have lived in MN

However when I use Reduce, I only get back "MN" and I can't identify who lived there.

statesincommon <- 
  Reduce(intersect, list (Anna[, 3], 
                         Yusuf[, 3],
                         Robin[, 3]))

I have databases of previous addresses for individuals and I want to identify if two or more people have lived in the same state and also who those people are. Right how the ID is just in the name of the database.

A mock data set is below.

Anna <- data.frame(Name = c('124 anne st', '400 rose pl', '45 prince st'), 
                  city= c('san francisco', 'brooklyn', 'minneapolis'),
                  state= c('CA', 'NY', 'MN'))

Yusuf <- data.frame(Name = c('12 fort st', '56 melrose pl', '123 main st'), 
                  city= c('new haven', 'atlanta', 'minneapolis'),
                  state= c('CT', 'GA', 'MN'))

Robin <- data.frame(Name = c('28 greene st', '67 apple pl', '222 bellvue ave'), 
                  city= c('houston', 'new york', 'minneapolis'),
                  state= c('TX', 'NY', 'MN'))

What I would like to identify is that

  1. anna and robin have lived in NY and
  2. anna, robin, and yusuf have lived in MN

However when I use Reduce, I only get back "MN" and I can't identify who lived there.

statesincommon <- 
  Reduce(intersect, list (Anna[, 3], 
                         Yusuf[, 3],
                         Robin[, 3]))

Share Improve this question asked Feb 5 at 16:23 chartreusefrogschartreusefrogs 335 bronze badges 1
  • 2 How did you get into this situation? Instead of storing the names in the object name, which makes things needlessy complicated, either bind them together into one data.frame, or keep them in a named list. – Axeman Commented Feb 5 at 16:44
Add a comment  | 

5 Answers 5

Reset to default 3

Probably the most sane thing to do is to first create a proper dataframe which combines your databases. The first only works if Anna, Yusuf, and Robin are the only data.frame-objects in the global environment.

# Create a named list of data frames automatically
df_list <- mget(ls())

# Filter only data frames (optional, in case there are other objects)
df_list <- df_list[sapply(df_list, is.data.frame)]

# Apply the function to add a new "Source" column dynamically
df_list <- lapply(names(df_list), function(name) {
  df <- df_list[[name]]  # Extract the data frame
  df$Source <- name      # Add the source column
  df                      # Return the modified data frame
})

# Combine all into one data frame
combined_df <- do.call(rbind, df_list)

# Count occurrences of each state
state_counts <- table(combined_df$state)

# Filter rows where state appears at least twice
filtered_df <- subset(combined_df, state %in% names(state_counts[state_counts >= 2]))

You could also do it with a function:

common_states <- function(...) {
  states <- setNames(lapply(list(...), \(x) unique(x$state)), as.character(match.call())[-1])
  Filter(length, sapply(unique(unlist(states)), \(s) {
    residents <- names(states)[sapply(states, \(x) s %in% x)]
    if (length(residents) > 1) residents
  }, simplify = FALSE))
}

> common_states(Anna, Yusuf, Robin)
$NY
[1] "Anna"  "Robin"

$MN
[1] "Anna"  "Yusuf" "Robin"

Or with tidyverse:

library(dplyr)

# Group by state and filter those with multiple people
result <- bind_rows(list(Anna = Anna, Yusuf = Yusuf, Robin = Robin), .id = "Person") %>%
  distinct(Person, state) %>%
  group_by(state) %>%
  filter(n() > 1) %>%
  summarise(Individuals = toString(unique(Person)))

>result 
# A tibble: 2 × 2
  state Individuals       
  <chr> <chr>             
1 MN    Anna, Yusuf, Robin
2 NY    Anna, Robin   

Create a vector of the input data frame names nms and then use mget to create a named list of those data frames, bind them together into a single data frame and split the names by state. This gives a list L of states for each name. If only the names with 2 or more states are wanted use Filter as shown below to get L2. We can also display L as a bipartite graph g using igraph. The same igraph code would work with L2 in place of L to just display the subgraph corresponding to the states with 2 or more names.

library (dplyr)
library (igraph)

nms <- c("Yusuf", "Robin", "Anna")
L <- nms %>%
  mget(.GlobalEnv) %>%
  bind_rows(.id = "name") %>%
  with(split(name, state))
str(L)
## List of 6
##  $ CA: chr "Anna"
##  $ CT: chr "Yusuf"
##  $ GA: chr "Yusuf"
##  $ MN: chr [1:3] "Yusuf" "Robin" "Anna"
##  $ NY: chr [1:2] "Robin" "Anna"
##  $ TX: chr "Robin"

L2 <- Filter(function(x) length(x) > 1, L)
str(L2)
## List of 2
##  $ MN: chr [1:3] "Yusuf" "Robin" "Anna"
##  $ NY: chr [1:2] "Robin" "Anna"

s <- stack(L)
g <- graph_from_data_frame(s, directed = F)
V(g)$type <- V(g)$name %in% s[,2]
V(g)$color <- ifelse(V(g)$type, "lightblue", "lightpink")
plot(g, layout = layout_as_bipartite, vertex.label.cex = 0.7,
  edge.color = "black")

I would try something along the lines

l = Filter(\(d) is(d, 'data.frame'), mget(ls()))
do.call('rbind', Map(cbind, l, name = names(l))) |> 
  aggregate(name ~ state, x = _, unique)

giving

  state               name
1    CA               Anna
2    CT              Yusuf
3    GA              Yusuf
4    MN Anna, Robin, Yusuf
5    NY        Anna, Robin
6    TX              Robin

if Anna, Yusuf, and Robin are the only data.frames in your session. Otherwise we would need to adjust the first line. We could add a filter if really needed.

Edit:

do.call('rbind', Map(cbind, l, name = names(l))) |> 
  aggregate(name ~ state, x = _, unique) |>
  { \(.) .[grep(' ', .$name), ] }()
  state               name
4    MN Anna, Robin, Yusuf
5    NY        Anna, Robin

if a piped version is o.k. to subset/filter.

Like others have mentioned, I would strongly suggest working in a list format. Then you could simply index the names using sapply:

ll <- list(Anna = Anna, 
           Yusuf = Yusuf, 
           Robin = Robin)

names(ll)[sapply(ll, \(x, y = "MN") any(x$state %in% y))]
# [1] "Anna"  "Yusuf" "Robin"

names(ll)[sapply(ll, \(x, y = "NY") any(x$state %in% y))]
# [1] "Anna"  "Robin"

Similarly, you could convert this to a function:

myFun <- function(llist, state){
  names(llist)[sapply(llist, \(x, y = state) any(x$state %in% y))]
}

myFun(ll, "NY")
# [1] "Anna"  "Robin"

You can try this

d <- table(
    stack(
        lapply(list(
            Anna = Anna,
            Yusuf = Yusuf,
            Robin = Robin
        ), `[[`, 3)
    )
) > 0
d[rowSums(d) > 1, ]

which shows

      ind
values Anna Yusuf Robin
    MN    1     1     1
    NY    1     0     1
发布评论

评论列表(0)

  1. 暂无评论