r - Is there a faster way to populate this vector?

I am trying to populate a new vector based on values of an original vector.

For example:

# Key
let <- c("a", "b", "c")
num <- c("one", "two", "three") 

# Given the following: 
v1 <-  c("one", "two", "three", "two", "one")
# Create the following using the key above:
v2 <- c("a", "b", "c", "b", "a")

I have used data.table and have had reasonable success, but I'm wondering if there's a strategy I've overlooked. I want to be able to do this on 1billion+ length vectors but am running into memory issues.

# EXAMPLE

# Create a number and letters that correspond to each other:
data_key <- c(1:100)
letter_class <- sample(letters, 100, replace = TRUE)

# Create vector of numbers
v1 <- sample(data_key, 1e8, replace = TRUE)
v2 <- c() # Make a v2 with letter_class that corresponds to number value in v1


# Create data with data_key and letter_class
key_table <- data.table(
  data_key,
  letter_class
)

d1 <- data.table(data_key = v1)

# Subset-method
t1 <- Sys.time()
v2_sub <- key_table[d1, , on = "data_key"][["letter_class"]]
Sys.time() - t1 
# Time difference of 3.457874 secs

# Merge-Method
t2 <- Sys.time()
v2_merge <- merge(d1,
            key_table,
            by = "data_key", 
            all.x = TRUE)[["letter_class"]]
Sys.time() - t2
# Time difference of 7.833402 secs

I have 32GB of RAM.

I am trying to populate a new vector based on values of an original vector.

For example:

# Key
let <- c("a", "b", "c")
num <- c("one", "two", "three") 

# Given the following: 
v1 <-  c("one", "two", "three", "two", "one")
# Create the following using the key above:
v2 <- c("a", "b", "c", "b", "a")

# EXAMPLE

# Create a number and letters that correspond to each other:
data_key <- c(1:100)
letter_class <- sample(letters, 100, replace = TRUE)

# Create vector of numbers
v1 <- sample(data_key, 1e8, replace = TRUE)
v2 <- c() # Make a v2 with letter_class that corresponds to number value in v1


# Create data with data_key and letter_class
key_table <- data.table(
  data_key,
  letter_class
)

d1 <- data.table(data_key = v1)

# Subset-method
t1 <- Sys.time()
v2_sub <- key_table[d1, , on = "data_key"][["letter_class"]]
Sys.time() - t1 
# Time difference of 3.457874 secs

# Merge-Method
t2 <- Sys.time()
v2_merge <- merge(d1,
            key_table,
            by = "data_key", 
            all.x = TRUE)[["letter_class"]]
Sys.time() - t2
# Time difference of 7.833402 secs

I have 32GB of RAM.

Share Improve this question edited Mar 11 at 18:25 asked Mar 11 at 17:46 gvan 5855 silver badges16 bronze badges

3 Said differently, gvan ... data.table is not going to speed up simple vector lookups like this, base R's [ primitive is about as fast as can be. If you have an example of when it produces incorrect results or when something else in R can out-perform it, I think it would be useful (and interesting) to include that in your question. – r2evans Commented Mar 11 at 18:06
2 Solved my problem using match(), it's quicker, and doesn't require me to create a data.table() – gvan Commented Mar 11 at 18:21
2 Btw, you can do Sys.time({v2match = letter_class[match(v1, data_key)]}), no need for assignment. – Friede Commented Mar 11 at 18:41
2 Not exactly a dupe as far as I can see, but more discussion of match here: stackoverflow/q/5577727/6851825 – Jon Spring Commented Mar 11 at 18:53
2 @Friede, I assume you meant to do system.time({..}), since Sys.time takes no arguments. (I do at times dislike many of R's function naming mis-conventions :-) – r2evans Commented Mar 11 at 22:32

| Show 6 more comments

2 Answers 2

Sorted by: Reset to default 5

The fastmatch package has a faster version of match. Benchmarking with your sample data, it's a bit better than twice as fast and uses less memory.

library(fastmatch)
bench::mark(
  base = letter_class[match(v1, data_key)],
  fastmatch = letter_class[fmatch(v1, data_key)]
)
# A tibble: 2 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time          
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>        
# 1 base          1.98s    1.98s     0.505    1.49GB    0.505     1     1      1.98s <chr>  <Rprofmem> <bench_tm [1]>
# 2 fastmatch  823.87ms 823.87ms     1.21     1.12GB    1.21      1     1   823.87ms <chr>  <Rprofmem> <bench_tm [1]>

Solved using match()

# Match-Method
t3 <- Sys.time()
v2_match <- letter_class[match(v1, data_key)]
Sys.time() - t3
# Time difference of 0.821255 secs

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - Is there a faster way to populate this vector? - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)