Is there a smart way to vectorize a nested for loop of inner products, where the inner index is lower bound by the outer index?
Here's a simple example. Say that arr1
and arr2
are numpy arrays each containing N
vectors of length 3, i.e. they are of shape (N,3)
. N
is usually in the range between 500 and 2000. I want to compute all possible combinations of inner products, but I also know that the mathematical problem is designed in such a way that the inner product of the i-th vector in arr1
and j
-th vector in arr2
is equal to the inner product of the j
-th vector in arr1
and i
-th vector in arr2
.
In an explicit nested for-loop this would look something like this.
N = 2000
# For simplicity, I choose arr1 and arr2 to be the same here, but this need not always be the case.
arr1 = np.arange(N*3).reshape((N,3))
arr2 = arr1.copy()
inner_products = np.zeros((arr1.shape[0],arr2.shape[0]))
for idx1 in range(arr1.shape[0]):
vec1 = arr1[idx1]
for idx2 in range(idx1, arr2.shape[0]):
vec2 = arr2[idx2]
inner_products[idx1,idx2] = np.dot(vec1,vec2)
inner_products = inner_products + inner_products.T
inner_products[np.diag_indices(arr1.shape[0])] /= 2
Is there a smart way to vectorize this in a single command, without carrying out twice as many calculations as needed?
So far, I vectorized the inner loop with
inner_products = np.zeros((arr1.shape[0],arr2.shape[0]))
for idx1 in range(arr1.shape[0]):
vec1 = arr1[idx1]
inner_products[idx1,idx1:] = np.dot(vec1.reshape(1,3), arr2[idx1:].T)
inner_products = inner_products + inner_products.T
inner_products[np.diag_indices(arr1.shape[0])] /= 2
But this variant is still slower than computing all N*N
inner products with a single np.dot
command despite computing twice as many inner products.
inner_products = np.dot(arr1, arr2.T)
Numba compilation also does not change the outcome; the single np.dot
command remains the fastest.
Is there a smart way to vectorize these sort of problems where the amount of computations is kept at a minimum?
Is there a smart way to vectorize a nested for loop of inner products, where the inner index is lower bound by the outer index?
Here's a simple example. Say that arr1
and arr2
are numpy arrays each containing N
vectors of length 3, i.e. they are of shape (N,3)
. N
is usually in the range between 500 and 2000. I want to compute all possible combinations of inner products, but I also know that the mathematical problem is designed in such a way that the inner product of the i-th vector in arr1
and j
-th vector in arr2
is equal to the inner product of the j
-th vector in arr1
and i
-th vector in arr2
.
In an explicit nested for-loop this would look something like this.
N = 2000
# For simplicity, I choose arr1 and arr2 to be the same here, but this need not always be the case.
arr1 = np.arange(N*3).reshape((N,3))
arr2 = arr1.copy()
inner_products = np.zeros((arr1.shape[0],arr2.shape[0]))
for idx1 in range(arr1.shape[0]):
vec1 = arr1[idx1]
for idx2 in range(idx1, arr2.shape[0]):
vec2 = arr2[idx2]
inner_products[idx1,idx2] = np.dot(vec1,vec2)
inner_products = inner_products + inner_products.T
inner_products[np.diag_indices(arr1.shape[0])] /= 2
Is there a smart way to vectorize this in a single command, without carrying out twice as many calculations as needed?
So far, I vectorized the inner loop with
inner_products = np.zeros((arr1.shape[0],arr2.shape[0]))
for idx1 in range(arr1.shape[0]):
vec1 = arr1[idx1]
inner_products[idx1,idx1:] = np.dot(vec1.reshape(1,3), arr2[idx1:].T)
inner_products = inner_products + inner_products.T
inner_products[np.diag_indices(arr1.shape[0])] /= 2
But this variant is still slower than computing all N*N
inner products with a single np.dot
command despite computing twice as many inner products.
inner_products = np.dot(arr1, arr2.T)
Numba compilation also does not change the outcome; the single np.dot
command remains the fastest.
Is there a smart way to vectorize these sort of problems where the amount of computations is kept at a minimum?
Share Improve this question asked Feb 16 at 0:07 AJoRAJoR 491 silver badge2 bronze badges1 Answer
Reset to default 1np.dot
is clearly not optimal here because the input arrays are integer-typed ones and also very thin. This is a pathological case. Besides, note that Numpy is also not optimized for arrays having very few items on the contiguous axis (a (3,N)
shape should be preferred). Fortunately, we can write a faster Numba code for this specific use-case:
import numba as nb
# Same as np.dot(arr1,arr2.T) for arrays with a shape (N,3)
@nb.njit('(int64[:,::1], int64[:,::1])', parallel=True)
def inner_product(arr1, arr2):
n = arr1.shape[0]
assert arr1.shape[1] == 3 and arr2.shape == (n, 3)
res = np.empty((n, n), dtype=arr1.dtype)
for i in nb.prange(n):
for j in range(n):
s = 0
if j < i: # Transposed case
for k in range(3):
s += arr1[j, k] * arr2[i, k]
else:
for k in range(3):
s += arr1[i, k] * arr2[j, k]
res[i, j] = s
return res
Benchmark
Here are results on my i5-9600KF CPU (6 cores) on the input array provided in the question (so with N = 2000
):
%timeit -n 10 first_op_code(arr1, arr2)
# 3 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 10 loop each)
%timeit -n 10 second_op_code(arr1, arr2)
# 54.5 ms ± 291 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 inner_product(arr1, arr2)
# 5.54 ms ± 362 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 np.dot(arr1, arr2.T)
# 26.1 ms ± 435 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can see that the Numba code is about 10 times faster than second_op_code
(and also faster than np.dot
which produces different results in some cases). I suspect the function to be rather memory-bound (or even bound by page-faults on Windows) since it does not scale well (the code is also faster in sequential). It makes sense since there is not a lot of computation to do compared to the memory accesses. Pre-allocating the output array might help (especially on Windows).
Note that enforcing the matrix symmetry by performing res[j, i] = res[i, j]
is simple but not efficient (due to the non-contiguous access pattern). This is actually about twice slower. Transposing the matrix is also rather slow for the same reason (though it can be optimized). Thus, I just recomputed the bottom-left part by transposing the input indices. This is significantly faster than the alternative solution. Re-computing values is cheap here because the whole operation is actually rather memory-bound.
Note that using smaller data types can also help to make the operation faster. You should be sure that there are no possible overflow though.