I have databases of previous addresses for individuals and I want to identify if two or more people have lived in the same state and also who those people are. Right how the ID is just in the name of the database.
A mock data set is below.
Anna <- data.frame(Name = c('124 anne st', '400 rose pl', '45 prince st'),
city= c('san francisco', 'brooklyn', 'minneapolis'),
state= c('CA', 'NY', 'MN'))
Yusuf <- data.frame(Name = c('12 fort st', '56 melrose pl', '123 main st'),
city= c('new haven', 'atlanta', 'minneapolis'),
state= c('CT', 'GA', 'MN'))
Robin <- data.frame(Name = c('28 greene st', '67 apple pl', '222 bellvue ave'),
city= c('houston', 'new york', 'minneapolis'),
state= c('TX', 'NY', 'MN'))
What I would like to identify is that
- anna and robin have lived in NY and
- anna, robin, and yusuf have lived in MN
However when I use Reduce, I only get back "MN" and I can't identify who lived there.
statesincommon <-
Reduce(intersect, list (Anna[, 3],
Yusuf[, 3],
Robin[, 3]))
I have databases of previous addresses for individuals and I want to identify if two or more people have lived in the same state and also who those people are. Right how the ID is just in the name of the database.
A mock data set is below.
Anna <- data.frame(Name = c('124 anne st', '400 rose pl', '45 prince st'),
city= c('san francisco', 'brooklyn', 'minneapolis'),
state= c('CA', 'NY', 'MN'))
Yusuf <- data.frame(Name = c('12 fort st', '56 melrose pl', '123 main st'),
city= c('new haven', 'atlanta', 'minneapolis'),
state= c('CT', 'GA', 'MN'))
Robin <- data.frame(Name = c('28 greene st', '67 apple pl', '222 bellvue ave'),
city= c('houston', 'new york', 'minneapolis'),
state= c('TX', 'NY', 'MN'))
What I would like to identify is that
- anna and robin have lived in NY and
- anna, robin, and yusuf have lived in MN
However when I use Reduce, I only get back "MN" and I can't identify who lived there.
statesincommon <-
Reduce(intersect, list (Anna[, 3],
Yusuf[, 3],
Robin[, 3]))
Share
Improve this question
asked Feb 5 at 16:23
chartreusefrogschartreusefrogs
335 bronze badges
1
- 2 How did you get into this situation? Instead of storing the names in the object name, which makes things needlessy complicated, either bind them together into one data.frame, or keep them in a named list. – Axeman Commented Feb 5 at 16:44
5 Answers
Reset to default 3Probably the most sane thing to do is to first create a proper dataframe which combines your databases. The first only works if Anna
, Yusuf
, and Robin
are the only data.frame-objects in the global environment.
# Create a named list of data frames automatically
df_list <- mget(ls())
# Filter only data frames (optional, in case there are other objects)
df_list <- df_list[sapply(df_list, is.data.frame)]
# Apply the function to add a new "Source" column dynamically
df_list <- lapply(names(df_list), function(name) {
df <- df_list[[name]] # Extract the data frame
df$Source <- name # Add the source column
df # Return the modified data frame
})
# Combine all into one data frame
combined_df <- do.call(rbind, df_list)
# Count occurrences of each state
state_counts <- table(combined_df$state)
# Filter rows where state appears at least twice
filtered_df <- subset(combined_df, state %in% names(state_counts[state_counts >= 2]))
You could also do it with a function:
common_states <- function(...) {
states <- setNames(lapply(list(...), \(x) unique(x$state)), as.character(match.call())[-1])
Filter(length, sapply(unique(unlist(states)), \(s) {
residents <- names(states)[sapply(states, \(x) s %in% x)]
if (length(residents) > 1) residents
}, simplify = FALSE))
}
> common_states(Anna, Yusuf, Robin)
$NY
[1] "Anna" "Robin"
$MN
[1] "Anna" "Yusuf" "Robin"
Or with tidyverse:
library(dplyr)
# Group by state and filter those with multiple people
result <- bind_rows(list(Anna = Anna, Yusuf = Yusuf, Robin = Robin), .id = "Person") %>%
distinct(Person, state) %>%
group_by(state) %>%
filter(n() > 1) %>%
summarise(Individuals = toString(unique(Person)))
>result
# A tibble: 2 × 2
state Individuals
<chr> <chr>
1 MN Anna, Yusuf, Robin
2 NY Anna, Robin
Create a vector of the input data frame names nms
and then use mget
to create a named list of those data frames, bind them together into a single data frame and split the names by state. This gives a list L
of states for each name. If only the names with 2 or more states are wanted use Filter
as shown below to get L2
. We can also display L
as a bipartite graph g
using igraph. The same igraph code would work with L2
in place of L
to just display the subgraph corresponding to the states with 2 or more names.
library (dplyr)
library (igraph)
nms <- c("Yusuf", "Robin", "Anna")
L <- nms %>%
mget(.GlobalEnv) %>%
bind_rows(.id = "name") %>%
with(split(name, state))
str(L)
## List of 6
## $ CA: chr "Anna"
## $ CT: chr "Yusuf"
## $ GA: chr "Yusuf"
## $ MN: chr [1:3] "Yusuf" "Robin" "Anna"
## $ NY: chr [1:2] "Robin" "Anna"
## $ TX: chr "Robin"
L2 <- Filter(function(x) length(x) > 1, L)
str(L2)
## List of 2
## $ MN: chr [1:3] "Yusuf" "Robin" "Anna"
## $ NY: chr [1:2] "Robin" "Anna"
s <- stack(L)
g <- graph_from_data_frame(s, directed = F)
V(g)$type <- V(g)$name %in% s[,2]
V(g)$color <- ifelse(V(g)$type, "lightblue", "lightpink")
plot(g, layout = layout_as_bipartite, vertex.label.cex = 0.7,
edge.color = "black")
I would try something along the lines
l = Filter(\(d) is(d, 'data.frame'), mget(ls()))
do.call('rbind', Map(cbind, l, name = names(l))) |>
aggregate(name ~ state, x = _, unique)
giving
state name
1 CA Anna
2 CT Yusuf
3 GA Yusuf
4 MN Anna, Robin, Yusuf
5 NY Anna, Robin
6 TX Robin
if Anna
, Yusuf
, and Robin
are the only data.frame
s in your session. Otherwise we would need to adjust the first line. We could add a filter if really needed.
Edit:
do.call('rbind', Map(cbind, l, name = names(l))) |>
aggregate(name ~ state, x = _, unique) |>
{ \(.) .[grep(' ', .$name), ] }()
state name
4 MN Anna, Robin, Yusuf
5 NY Anna, Robin
if a piped version is o.k. to subset/filter.
Like others have mentioned, I would strongly suggest working in a list format. Then you could simply index the names using sapply
:
ll <- list(Anna = Anna,
Yusuf = Yusuf,
Robin = Robin)
names(ll)[sapply(ll, \(x, y = "MN") any(x$state %in% y))]
# [1] "Anna" "Yusuf" "Robin"
names(ll)[sapply(ll, \(x, y = "NY") any(x$state %in% y))]
# [1] "Anna" "Robin"
Similarly, you could convert this to a function:
myFun <- function(llist, state){
names(llist)[sapply(llist, \(x, y = state) any(x$state %in% y))]
}
myFun(ll, "NY")
# [1] "Anna" "Robin"
You can try this
d <- table(
stack(
lapply(list(
Anna = Anna,
Yusuf = Yusuf,
Robin = Robin
), `[[`, 3)
)
) > 0
d[rowSums(d) > 1, ]
which shows
ind
values Anna Yusuf Robin
MN 1 1 1
NY 1 0 1