python - Is there a smart way to vectorize a nested for-loop where the inner index is limited by the outer index?

Is there a smart way to vectorize a nested for loop of inner products, where the inner index is lower bound by the outer index?

Here's a simple example. Say that arr1 and arr2 are numpy arrays each containing N vectors of length 3, i.e. they are of shape (N,3). N is usually in the range between 500 and 2000. I want to compute all possible combinations of inner products, but I also know that the mathematical problem is designed in such a way that the inner product of the i-th vector in arr1 and j-th vector in arr2 is equal to the inner product of the j-th vector in arr1 and i-th vector in arr2.

In an explicit nested for-loop this would look something like this.

N = 2000
# For simplicity, I choose arr1 and arr2 to be the same here, but this need not always be the case.
arr1 = np.arange(N*3).reshape((N,3))
arr2 = arr1.copy()

inner_products = np.zeros((arr1.shape[0],arr2.shape[0]))
for idx1 in range(arr1.shape[0]):  
    vec1 = arr1[idx1]
    for idx2 in range(idx1, arr2.shape[0]):
        vec2 = arr2[idx2]
        inner_products[idx1,idx2] = np.dot(vec1,vec2)
inner_products = inner_products + inner_products.T
inner_products[np.diag_indices(arr1.shape[0])] /= 2

Is there a smart way to vectorize this in a single command, without carrying out twice as many calculations as needed?

So far, I vectorized the inner loop with

inner_products = np.zeros((arr1.shape[0],arr2.shape[0]))
for idx1 in range(arr1.shape[0]):  
    vec1 = arr1[idx1]
    inner_products[idx1,idx1:] = np.dot(vec1.reshape(1,3), arr2[idx1:].T)
inner_products = inner_products + inner_products.T
inner_products[np.diag_indices(arr1.shape[0])] /= 2

But this variant is still slower than computing all N*N inner products with a single np.dot command despite computing twice as many inner products.

inner_products = np.dot(arr1, arr2.T)

Numba compilation also does not change the outcome; the single np.dot command remains the fastest.

Is there a smart way to vectorize these sort of problems where the amount of computations is kept at a minimum?

Is there a smart way to vectorize a nested for loop of inner products, where the inner index is lower bound by the outer index?

In an explicit nested for-loop this would look something like this.

N = 2000
# For simplicity, I choose arr1 and arr2 to be the same here, but this need not always be the case.
arr1 = np.arange(N*3).reshape((N,3))
arr2 = arr1.copy()

inner_products = np.zeros((arr1.shape[0],arr2.shape[0]))
for idx1 in range(arr1.shape[0]):  
    vec1 = arr1[idx1]
    for idx2 in range(idx1, arr2.shape[0]):
        vec2 = arr2[idx2]
        inner_products[idx1,idx2] = np.dot(vec1,vec2)
inner_products = inner_products + inner_products.T
inner_products[np.diag_indices(arr1.shape[0])] /= 2

Is there a smart way to vectorize this in a single command, without carrying out twice as many calculations as needed?

So far, I vectorized the inner loop with

inner_products = np.zeros((arr1.shape[0],arr2.shape[0]))
for idx1 in range(arr1.shape[0]):  
    vec1 = arr1[idx1]
    inner_products[idx1,idx1:] = np.dot(vec1.reshape(1,3), arr2[idx1:].T)
inner_products = inner_products + inner_products.T
inner_products[np.diag_indices(arr1.shape[0])] /= 2

But this variant is still slower than computing all N*N inner products with a single np.dot command despite computing twice as many inner products.

inner_products = np.dot(arr1, arr2.T)

Numba compilation also does not change the outcome; the single np.dot command remains the fastest.

Is there a smart way to vectorize these sort of problems where the amount of computations is kept at a minimum?

Share Improve this question asked Feb 16 at 0:07 AJoR 491 silver badge2 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

np.dot is clearly not optimal here because the input arrays are integer-typed ones and also very thin. This is a pathological case. Besides, note that Numpy is also not optimized for arrays having very few items on the contiguous axis (a (3,N) shape should be preferred). Fortunately, we can write a faster Numba code for this specific use-case:

import numba as nb

# Same as np.dot(arr1,arr2.T) for arrays with a shape (N,3)
@nb.njit('(int64[:,::1], int64[:,::1])', parallel=True)
def inner_product(arr1, arr2):
    n = arr1.shape[0]
    assert arr1.shape[1] == 3 and arr2.shape == (n, 3)
    res = np.empty((n, n), dtype=arr1.dtype)
    for i in nb.prange(n):
        for j in range(n):
            s = 0
            if j < i: # Transposed case
                for k in range(3):
                    s += arr1[j, k] * arr2[i, k]
            else:
                for k in range(3):
                    s += arr1[i, k] * arr2[j, k]
            res[i, j] = s
    return res

Benchmark

Here are results on my i5-9600KF CPU (6 cores) on the input array provided in the question (so with N = 2000):

%timeit -n 10 first_op_code(arr1, arr2)
# 3 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 10 loop each)

%timeit -n 10 second_op_code(arr1, arr2)
# 54.5 ms ± 291 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 inner_product(arr1, arr2)
# 5.54 ms ± 362 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 np.dot(arr1, arr2.T)
# 26.1 ms ± 435 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can see that the Numba code is about 10 times faster than second_op_code (and also faster than np.dot which produces different results in some cases). I suspect the function to be rather memory-bound (or even bound by page-faults on Windows) since it does not scale well (the code is also faster in sequential). It makes sense since there is not a lot of computation to do compared to the memory accesses. Pre-allocating the output array might help (especially on Windows).

Note that enforcing the matrix symmetry by performing res[j, i] = res[i, j] is simple but not efficient (due to the non-contiguous access pattern). This is actually about twice slower. Transposing the matrix is also rather slow for the same reason (though it can be optimized). Thus, I just recomputed the bottom-left part by transposing the input indices. This is significantly faster than the alternative solution. Re-computing values is cheap here because the whole operation is actually rather memory-bound.

Note that using smaller data types can also help to make the operation faster. You should be sure that there are no possible overflow though.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Is there a smart way to vectorize a nested for-loop where the inner index is limited by the outer index? - Stack Overfl

1 Answer 1

Benchmark

与本文相关的文章

评论列表(0)