r - How can I find common values in multiple dataframes?

I have databases of previous addresses for individuals and I want to identify if two or more people have lived in the same state and also who those people are. Right how the ID is just in the name of the database.

A mock data set is below.

Anna <- data.frame(Name = c('124 anne st', '400 rose pl', '45 prince st'), 
                  city= c('san francisco', 'brooklyn', 'minneapolis'),
                  state= c('CA', 'NY', 'MN'))

Yusuf <- data.frame(Name = c('12 fort st', '56 melrose pl', '123 main st'), 
                  city= c('new haven', 'atlanta', 'minneapolis'),
                  state= c('CT', 'GA', 'MN'))

Robin <- data.frame(Name = c('28 greene st', '67 apple pl', '222 bellvue ave'), 
                  city= c('houston', 'new york', 'minneapolis'),
                  state= c('TX', 'NY', 'MN'))

What I would like to identify is that

anna and robin have lived in NY and
anna, robin, and yusuf have lived in MN

However when I use Reduce, I only get back "MN" and I can't identify who lived there.

statesincommon <- 
  Reduce(intersect, list (Anna[, 3], 
                         Yusuf[, 3],
                         Robin[, 3]))

A mock data set is below.

Anna <- data.frame(Name = c('124 anne st', '400 rose pl', '45 prince st'), 
                  city= c('san francisco', 'brooklyn', 'minneapolis'),
                  state= c('CA', 'NY', 'MN'))

Yusuf <- data.frame(Name = c('12 fort st', '56 melrose pl', '123 main st'), 
                  city= c('new haven', 'atlanta', 'minneapolis'),
                  state= c('CT', 'GA', 'MN'))

Robin <- data.frame(Name = c('28 greene st', '67 apple pl', '222 bellvue ave'), 
                  city= c('houston', 'new york', 'minneapolis'),
                  state= c('TX', 'NY', 'MN'))

What I would like to identify is that

anna and robin have lived in NY and
anna, robin, and yusuf have lived in MN

However when I use Reduce, I only get back "MN" and I can't identify who lived there.

statesincommon <- 
  Reduce(intersect, list (Anna[, 3], 
                         Yusuf[, 3],
                         Robin[, 3]))

Share Improve this question asked Feb 5 at 16:23 chartreusefrogs 335 bronze badges

2 How did you get into this situation? Instead of storing the names in the object name, which makes things needlessy complicated, either bind them together into one data.frame, or keep them in a named list. – Axeman Commented Feb 5 at 16:44

Add a comment |

5 Answers 5

Sorted by: Reset to default 3

Probably the most sane thing to do is to first create a proper dataframe which combines your databases. The first only works if Anna, Yusuf, and Robin are the only data.frame-objects in the global environment.

# Create a named list of data frames automatically
df_list <- mget(ls())

# Filter only data frames (optional, in case there are other objects)
df_list <- df_list[sapply(df_list, is.data.frame)]

# Apply the function to add a new "Source" column dynamically
df_list <- lapply(names(df_list), function(name) {
  df <- df_list[[name]]  # Extract the data frame
  df$Source <- name      # Add the source column
  df                      # Return the modified data frame
})

# Combine all into one data frame
combined_df <- do.call(rbind, df_list)

# Count occurrences of each state
state_counts <- table(combined_df$state)

# Filter rows where state appears at least twice
filtered_df <- subset(combined_df, state %in% names(state_counts[state_counts >= 2]))

You could also do it with a function:

common_states <- function(...) {
  states <- setNames(lapply(list(...), \(x) unique(x$state)), as.character(match.call())[-1])
  Filter(length, sapply(unique(unlist(states)), \(s) {
    residents <- names(states)[sapply(states, \(x) s %in% x)]
    if (length(residents) > 1) residents
  }, simplify = FALSE))
}

> common_states(Anna, Yusuf, Robin)
$NY
[1] "Anna"  "Robin"

$MN
[1] "Anna"  "Yusuf" "Robin"

Or with tidyverse:

library(dplyr)

# Group by state and filter those with multiple people
result <- bind_rows(list(Anna = Anna, Yusuf = Yusuf, Robin = Robin), .id = "Person") %>%
  distinct(Person, state) %>%
  group_by(state) %>%
  filter(n() > 1) %>%
  summarise(Individuals = toString(unique(Person)))

>result 
# A tibble: 2 × 2
  state Individuals       
  <chr> <chr>             
1 MN    Anna, Yusuf, Robin
2 NY    Anna, Robin

Create a vector of the input data frame names nms and then use mget to create a named list of those data frames, bind them together into a single data frame and split the names by state. This gives a list L of states for each name. If only the names with 2 or more states are wanted use Filter as shown below to get L2. We can also display L as a bipartite graph g using igraph. The same igraph code would work with L2 in place of L to just display the subgraph corresponding to the states with 2 or more names.

library (dplyr)
library (igraph)

nms <- c("Yusuf", "Robin", "Anna")
L <- nms %>%
  mget(.GlobalEnv) %>%
  bind_rows(.id = "name") %>%
  with(split(name, state))
str(L)
## List of 6
##  $ CA: chr "Anna"
##  $ CT: chr "Yusuf"
##  $ GA: chr "Yusuf"
##  $ MN: chr [1:3] "Yusuf" "Robin" "Anna"
##  $ NY: chr [1:2] "Robin" "Anna"
##  $ TX: chr "Robin"

L2 <- Filter(function(x) length(x) > 1, L)
str(L2)
## List of 2
##  $ MN: chr [1:3] "Yusuf" "Robin" "Anna"
##  $ NY: chr [1:2] "Robin" "Anna"

s <- stack(L)
g <- graph_from_data_frame(s, directed = F)
V(g)$type <- V(g)$name %in% s[,2]
V(g)$color <- ifelse(V(g)$type, "lightblue", "lightpink")
plot(g, layout = layout_as_bipartite, vertex.label.cex = 0.7,
  edge.color = "black")

I would try something along the lines

l = Filter(\(d) is(d, 'data.frame'), mget(ls()))
do.call('rbind', Map(cbind, l, name = names(l))) |> 
  aggregate(name ~ state, x = _, unique)

giving

  state               name
1    CA               Anna
2    CT              Yusuf
3    GA              Yusuf
4    MN Anna, Robin, Yusuf
5    NY        Anna, Robin
6    TX              Robin

if Anna, Yusuf, and Robin are the only data.frames in your session. Otherwise we would need to adjust the first line. We could add a filter if really needed.

Edit:

do.call('rbind', Map(cbind, l, name = names(l))) |> 
  aggregate(name ~ state, x = _, unique) |>
  { \(.) .[grep(' ', .$name), ] }()

  state               name
4    MN Anna, Robin, Yusuf
5    NY        Anna, Robin

if a piped version is o.k. to subset/filter.

Like others have mentioned, I would strongly suggest working in a list format. Then you could simply index the names using sapply:

ll <- list(Anna = Anna, 
           Yusuf = Yusuf, 
           Robin = Robin)

names(ll)[sapply(ll, \(x, y = "MN") any(x$state %in% y))]
# [1] "Anna"  "Yusuf" "Robin"

names(ll)[sapply(ll, \(x, y = "NY") any(x$state %in% y))]
# [1] "Anna"  "Robin"

Similarly, you could convert this to a function:

myFun <- function(llist, state){
  names(llist)[sapply(llist, \(x, y = state) any(x$state %in% y))]
}

myFun(ll, "NY")
# [1] "Anna"  "Robin"

You can try this

d <- table(
    stack(
        lapply(list(
            Anna = Anna,
            Yusuf = Yusuf,
            Robin = Robin
        ), `[[`, 3)
    )
) > 0
d[rowSums(d) > 1, ]

which shows

      ind
values Anna Yusuf Robin
    MN    1     1     1
    NY    1     0     1

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - How can I find common values in multiple dataframes? - Stack Overflow

5 Answers 5

与本文相关的文章

评论列表(0)