I am trying to populate a new vector based on values of an original vector.
For example:
# Key
let <- c("a", "b", "c")
num <- c("one", "two", "three")
# Given the following:
v1 <- c("one", "two", "three", "two", "one")
# Create the following using the key above:
v2 <- c("a", "b", "c", "b", "a")
I have used data.table
and have had reasonable success, but I'm wondering if there's a strategy I've overlooked. I want to be able to do this on 1billion+ length vectors but am running into memory issues.
# EXAMPLE
# Create a number and letters that correspond to each other:
data_key <- c(1:100)
letter_class <- sample(letters, 100, replace = TRUE)
# Create vector of numbers
v1 <- sample(data_key, 1e8, replace = TRUE)
v2 <- c() # Make a v2 with letter_class that corresponds to number value in v1
# Create data with data_key and letter_class
key_table <- data.table(
data_key,
letter_class
)
d1 <- data.table(data_key = v1)
# Subset-method
t1 <- Sys.time()
v2_sub <- key_table[d1, , on = "data_key"][["letter_class"]]
Sys.time() - t1
# Time difference of 3.457874 secs
# Merge-Method
t2 <- Sys.time()
v2_merge <- merge(d1,
key_table,
by = "data_key",
all.x = TRUE)[["letter_class"]]
Sys.time() - t2
# Time difference of 7.833402 secs
I have 32GB of RAM.
I am trying to populate a new vector based on values of an original vector.
For example:
# Key
let <- c("a", "b", "c")
num <- c("one", "two", "three")
# Given the following:
v1 <- c("one", "two", "three", "two", "one")
# Create the following using the key above:
v2 <- c("a", "b", "c", "b", "a")
I have used data.table
and have had reasonable success, but I'm wondering if there's a strategy I've overlooked. I want to be able to do this on 1billion+ length vectors but am running into memory issues.
# EXAMPLE
# Create a number and letters that correspond to each other:
data_key <- c(1:100)
letter_class <- sample(letters, 100, replace = TRUE)
# Create vector of numbers
v1 <- sample(data_key, 1e8, replace = TRUE)
v2 <- c() # Make a v2 with letter_class that corresponds to number value in v1
# Create data with data_key and letter_class
key_table <- data.table(
data_key,
letter_class
)
d1 <- data.table(data_key = v1)
# Subset-method
t1 <- Sys.time()
v2_sub <- key_table[d1, , on = "data_key"][["letter_class"]]
Sys.time() - t1
# Time difference of 3.457874 secs
# Merge-Method
t2 <- Sys.time()
v2_merge <- merge(d1,
key_table,
by = "data_key",
all.x = TRUE)[["letter_class"]]
Sys.time() - t2
# Time difference of 7.833402 secs
I have 32GB of RAM.
Share Improve this question edited Mar 11 at 18:25 gvan asked Mar 11 at 17:46 gvangvan 5855 silver badges16 bronze badges 11 | Show 6 more comments2 Answers
Reset to default 5The fastmatch
package has a faster version of match
. Benchmarking with your sample data, it's a bit better than twice as fast and uses less memory.
library(fastmatch)
bench::mark(
base = letter_class[match(v1, data_key)],
fastmatch = letter_class[fmatch(v1, data_key)]
)
# A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list>
# 1 base 1.98s 1.98s 0.505 1.49GB 0.505 1 1 1.98s <chr> <Rprofmem> <bench_tm [1]>
# 2 fastmatch 823.87ms 823.87ms 1.21 1.12GB 1.21 1 1 823.87ms <chr> <Rprofmem> <bench_tm [1]>
Solved using match()
# Match-Method
t3 <- Sys.time()
v2_match <- letter_class[match(v1, data_key)]
Sys.time() - t3
# Time difference of 0.821255 secs
data.table
is not going to speed up simple vector lookups like this, base R's[
primitive is about as fast as can be. If you have an example of when it produces incorrect results or when something else in R can out-perform it, I think it would be useful (and interesting) to include that in your question. – r2evans Commented Mar 11 at 18:06Sys.time({v2match = letter_class[match(v1, data_key)]})
, no need for assignment. – Friede Commented Mar 11 at 18:41match
here: stackoverflow/q/5577727/6851825 – Jon Spring Commented Mar 11 at 18:53system.time({..})
, sinceSys.time
takes no arguments. (I do at times dislike many of R's function naming mis-conventions :-) – r2evans Commented Mar 11 at 22:32