performance - How to improve efficiency of my python function involving sparse matrices?

I want to implement a particular function in python involving sparse matrices and want to make it as fast as possible. I describe in detail what the problem is and how I implemented it so far.

Problem: I have N=1000 fixed (dimension is fixed, entries fixed) sparse matrices collectively called B each of size 1000x1000 (average sparsity, i.e. number of non-zero entries over all entries, is 0.0001). For a given vector u (of size 1000) I want to compute c[j] = u @ B[j] @ u for each j=0,...,999 and the output should be the numpy array c. So the sparse matrices B[j] (stored in the tuple B) are fixed and u is my function input.

My implementation so far: I precompute all the matrices and treat them as global variables in my program. I decided to safe them as scipy.sparse matrices in csr_matrix format (I read that this is the best format when I just want to calculate matrix vector products) in a Tuple. To calculate my desired function I do

# precompute matrices, so treat them as global fixed variables
B = []
for j in range(1000):
   sparse_matrix = ..pute it... # sparse csr_matrix of size 1000x1000
   B.append(sparse_matrix)
B = tuple(B)

def func(u: np.ndarray) -> np.ndarray:
    """
    u is a np.array of length N=1000
    """
    return np.array([u.dot(B[k].dot(u.transpose())) for k in range(len(B))])

Question: Is this the most efficient it can get, or do you see room for improvement regarding speed, e.g. change the structure how I safe the matrices or how I compute all the vector-matrix-vector products? Also, do you see potential for parallelising this computation? If so, do you have a suggestion what libraries/functions I should look into?

Thanks in advance!

Edit: With c[j] = u @ B[j] @ u I mean that I want to compute the matrix-vector product of B[j] and u and then the inner product with u. (Mathematically u.transposed() * B * u)

If it is of help to anyone. Here is a small benchmark program where I create some random sparse matrices and evaluate it on some random vector.

import numpy as np
from random import randint
from scipy.sparse import coo_matrix, csr_matrix
from time import time

# Create random tuple of sparse matrices
N = 1000
n = 100
B = []
for i in range(N):
    data = np.random.uniform(-1, 1, n).tolist()
    rows = [randint(0, N-1) for _ in range(n)]
    cols = [randint(0, N-1) for _ in range(n)]
    sparse_matrix = csr_matrix(coo_matrix((data, (rows, cols)), shape=(N, N)))
    B.append(sparse_matrix)
B = tuple(B)

# My function
def func(u: np.ndarray) -> np.ndarray:
    """
    u is a np.array of length N=1000
    """
    return np.array([u.dot(B[k].dot(u.transpose())) for k in range(len(B))])

# random vector to evaluate function
u = np.random.uniform(-1, 1, N)

START = time()
func(u)
END = time()
print(f"Speed : {END - START}")

>>> Speed : 0.005256175994873047

I want to implement a particular function in python involving sparse matrices and want to make it as fast as possible. I describe in detail what the problem is and how I implemented it so far.

# precompute matrices, so treat them as global fixed variables
B = []
for j in range(1000):
   sparse_matrix = ...compute it... # sparse csr_matrix of size 1000x1000
   B.append(sparse_matrix)
B = tuple(B)

def func(u: np.ndarray) -> np.ndarray:
    """
    u is a np.array of length N=1000
    """
    return np.array([u.dot(B[k].dot(u.transpose())) for k in range(len(B))])

Thanks in advance!

Edit: With c[j] = u @ B[j] @ u I mean that I want to compute the matrix-vector product of B[j] and u and then the inner product with u. (Mathematically u.transposed() * B * u)

If it is of help to anyone. Here is a small benchmark program where I create some random sparse matrices and evaluate it on some random vector.

import numpy as np
from random import randint
from scipy.sparse import coo_matrix, csr_matrix
from time import time

# Create random tuple of sparse matrices
N = 1000
n = 100
B = []
for i in range(N):
    data = np.random.uniform(-1, 1, n).tolist()
    rows = [randint(0, N-1) for _ in range(n)]
    cols = [randint(0, N-1) for _ in range(n)]
    sparse_matrix = csr_matrix(coo_matrix((data, (rows, cols)), shape=(N, N)))
    B.append(sparse_matrix)
B = tuple(B)

# My function
def func(u: np.ndarray) -> np.ndarray:
    """
    u is a np.array of length N=1000
    """
    return np.array([u.dot(B[k].dot(u.transpose())) for k in range(len(B))])

# random vector to evaluate function
u = np.random.uniform(-1, 1, N)

START = time()
func(u)
END = time()
print(f"Speed : {END - START}")

>>> Speed : 0.005256175994873047

Share Improve this question edited yesterday asked yesterday julian2000P 1536 bronze badges

How big is len(sparse_matrices) in practice? Parallelizing the j-based loop should be simple and faster though probably sub-optimal in Python. What is the data-type of the vectors/matrices (e.g. float64)? – Jérôme Richard Commented yesterday
1 len(sparse_matrices) is 1000, the same as the dimension of the rows and columns of each individual matrix. So actually, I work with a (very) sparse 3-d tensor in R^{1000x1000x1000}. Yes, the datatype is float64 for both vectors and matrices. However, eventually for the vectors a and b I also want to use complex numbers. – julian2000P Commented yesterday
In context there are some matrix preconstruction steps "in real life" that aren't shown here, like potentially forming B in a single 1,000,000x1000. But... Isn't 5 ms good enough? Why fuss over this? Is it being called in an outer loop? – Reinderien Commented yesterday
N = 1000 was an example. I want to increase this number even further in the future (think of 5000). And ultimately this function is called a lot in a numerical ODE solver where I need repeated evaluation of this function. So, I am trying the best I can to make it as fast as possible. Stacking the matrix on top of each other and forming one big matrix seems like a good idea. But how would I evaluate this function. Let B be the 1,000,000 x 1000 matrix. The I can call B.dot(b). But how would I multiply this giant matrix from the left with a? – julian2000P Commented yesterday
During repeated evaluation, which terms stay the same and which terms change? – Reinderien Commented yesterday

| Show 6 more comments

3 Answers 3

Sorted by: Reset to default 2

One idea I'd suggest is implementing the matrix multiply in Numba. This gives you two advantages.

When calculating B[j] @ b, it isn't necessary to allocate an entire vector, since we're about to take a dot product with the a vector. We can keep it as a scalar through the entire calculation.
Since the number of matrix elements is smaller than the number of rows, a matrix in COO representation is smaller than one in CSR representation. Specifically, in COO, this takes (100 + 100 + 100) * 8 bytes, and in CSR, this takes (1001 + 100 + 100) * 8 bytes. Numba is flexible enough to let us write a COO algorithm.

Code:

import numba as nb


@nb.njit(fastmath=True)
def mult_matrix_coo(a, b, coo_row, coo_col, coo_data):
    assert len(coo_row) == len(coo_col) == len(coo_data)
    sum_ = 0
    for i in range(len(coo_row)):
        row = coo_row[i]
        col = coo_col[i]
        data = coo_data[i]
        sum_ += a[row] * b[col] * data
    return sum_


def func_nb(a: np.ndarray, b: np.ndarray, sparse_matrices: Tuple[csr_matrix]) -> np.ndarray:
    """
    a and b are np.arrays of length N
    sparse_matrices is precomputed tuple of coo_matrices
    """
    return np.array([mult_matrix_coo(a, b, M.row, M.col, M.data) for M in sparse_matrices])

sparse_matrices_coo = tuple(m.tocoo() for m in sparse_matrices)
func_nb(u, u, sparse_matrices_coo)

Benchmark:

>>> %timeit func(u, u, sparse_matrices)
12.7 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit func_nb(u, u, sparse_matrices_coo)
1.82 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Notes:

Testing this on a random dataset, it is within 1e-14 of the existing implementation.
I enabled fastmath=True to allow Numba to unroll the loop in mult_matrix_coo().
I didn't find that parallelization helped.

It's important to observe that your uBu operation is quadratic-form. There are some optimizations to be had as a result; for instance, you can rewrite B as a strictly upper-triangular matrix. For highly-sparse data, the chance of reducing the number of nonzeros is low, but it can still potentially allow for more compact sparse representation.

In the dense case, B should simply be an N,N,N tensor (not a tuple of arrays); and your operation should simply be u @ B @ u (no transpose necessary) since Numpy performs broadcasting on matrix products.

Of course, scipy.sparse does not support this, so you should still have a single matrix but the first two dimensions should be merged. A demonstration in the dense case that shows conversion from N,N,N to N**2,N can look like

def func(u: np.ndarray, B: np.ndarray) -> np.ndarray:
    B = np.triu(B + np.triu(B.transpose((0, 2, 1)), k=1))
    n = u.size
    rhs = B.reshape((n**2, n)) @ u
    rhs_square = rhs.reshape((n, n))
    return rhs_square @ u

In the sparse case, do the triangularisation and reshape of B prior to your iterative ODE; then your function will look like (with some changed typehints)

def func(u: np.ndarray, Btri: np.ndarray) -> np.ndarray:
    n = u.size
    return (Btri @ u).reshape((n, n)) @ u

If you want to get fancier, if the triangularisation alternates between upper and lower-triangular representations, then the product will still be equivalent and then N**2,N representation will have at least 50% of its diagonals zero. For example,

rand = np.random.default_rng(seed=0)

N = 4
B = rand.integers(10, size=(N, N, N))
Bt = B.transpose((0, 2, 1))
idx0 = (slice(0, None, 2), ...)
idx1 = (slice(1, None, 2), ...)
B[idx0] = np.tril(
    B[idx0] + np.tril(Bt[idx0], k=-1)
)
B[idx1] = np.triu(
    B[idx1] + np.triu(Bt[idx1], k=1)
)
Btri = B.reshape((N**2, N))
print(Btri)

[[ 8  0  0  0]
 [ 9  0  0  0]
 [ 6  8  6  0]
 [ 7  6 18  7]
 [ 6  7  8 16]
 [ 0  8 14  7]
 [ 0  0  5  8]
 [ 0  0  0  1]
 [ 0  0  0  0]
 [ 8  2  0  0]
 [ 4  4  0  0]
 [ 5 10  6  6]
 [ 2 10 10 11]
 [ 0  9 14 15]
 [ 0  0  9 13]
 [ 0  0  0  3]]

In addition to the use of Numba, the code can be parallelized. However, parallelizing it efficiently turns out to be very hard with the current input datatype. Indeed:

Python's multiprocessing is slow in this case because inter-process communication is far too expensive in this case;
CPython's multi-threading is limited by the GIL (global interpreter lock);
All alternative modules used to parallelize codes (like Joblib) are limited by the two previous points;
Numba cannot operate on Python objects (like lists) parallel loops so CPython lists (i.e. reflected lists) should be converted to typed lists, but this conversion is very expensive... so much that it is actually slower than the computation.
Each Numpy operation called from CPython typically last for few µs and there are 1000 items in the list so we cannot afford the cost of calling Numpy fonction on each item of B.
Copying all Numpy array into a big one is also too expensive (too much data while the problem is not really compute bound).

The only fast way to transfer data to Numba is using Numpy arrays. Thus, the only solution I can think of is to extract the pointer of each internal Numpy array of the sparse matrices within B so to then build an array of pointer. In Numba, we can then convert the pointers in C arrays inside a parallel loop. This is tricky since Numba provides no simple way to convert integers to pointer (nor apparently true pointer types). Last but not least, I found out that extracting the pointer with array.ctypes.data is surprisingly slow so it does not really worth it unless we write our own Numba function to extract each pointer... Calling a Numba function from CPython just to extract a pointer is expensive (and make me sad), but I am not aware of any better solution so far... To reduce the overhead, I use this function to operate on 3 pointer and also so extract the size of the array (not needed if all arrays have the same size which is not strictly the case in the example). This looks like a fight against CPython's and Numpy's huge overheads (even for basic operations that should be very cheap). All of this makes the resulting Numba parallel implementation artificially complicated and really convoluted.

Here is the resulting Numba code:

import numba as nb
import numpy as np

@nb.njit(['(int64, float64[::1], int32[::1], int32[::1], float64[::1])', '(int64, float64[::1], int64[::1], int64[::1], float64[::1])'], inline='always')
def compute_row(i, x_data, x_cols, x_rows, u):
    s = np.float64(0)
    for j in range(x_rows[i], x_rows[i+1]):
        s += x_data[j] * u[x_cols[j]]
    return u[i] * s

# Matrix-vector product. Only for CSR sparse matrices.
@nb.njit(['(float64[::1], int32[::1], int32[::1], float64[::1])', '(float64[::1], int64[::1], int64[::1], float64[::1])'])
def sparse_compute(x_data, x_cols, x_rows, u):
    total = np.float64(0)
    for i in range(x_rows.size-1):
        s = np.float64(0)
        for j in range(x_rows[i], x_rows[i+1]):
            s += x_data[j] * u[x_cols[j]]
        total += u[i] * s
    return total

# Convert an integer to a pointer
@nb.extending.intrinsic
def address_as_void_pointer(typingctx, src):
    """ returns a void pointer from a given memory address """
    from numba.core import types, cgutils
    sig = types.voidptr(src)
    def codegen(cgctx, builder, sig, args):
        return builder.inttoptr(args[0], cgutils.voidptr_t)
    return sig, codegen

# Wrapping Numba function (since Numba cannot directly operate on sparse-matrix objects)
# Multiple matrix-vector products 
@nb.njit(['(uint64[::1],float64[::1],int32[::1])', '(uint64[::1],float64[::1],int64[::1])'], parallel=True)
def sparse_compute_xN(raw_data, u, indexItem):
    n = len(raw_data) // 6
    result = np.empty(n, dtype=np.float64)
    for i in nb.prange(n):
        multi_data = nb.carray(address_as_void_pointer(raw_data[i*6+0]), raw_data[i*6+1], dtype=np.float64)
        multi_cols = nb.carray(address_as_void_pointer(raw_data[i*6+2]), raw_data[i*6+3], dtype=indexItem.dtype)
        multi_rows = nb.carray(address_as_void_pointer(raw_data[i*6+4]), raw_data[i*6+5], dtype=indexItem.dtype)
        result[i] = sparse_compute(multi_data, multi_cols, multi_rows, u)
    return result

@nb.njit(['(float64[::1], int32[::1], int32[::1])', '(float64[::1], int64[::1], int64[::1])'])
def ptr_of(data, indices, indptr):
    return (data.ctypes.data, data.size, indices.ctypes.data, indices.size, indptr.ctypes.data, indptr.size)

def func_fast(u, B):
    assert len(B) > 0
    assert all(b.indices.dtype == b.indptr.dtype for b in B)
    # This part is expensive: it just convert the CPython data-structure to an efficient Numba-friendly one
    raw_data = np.fromiter([e for b in B for e in ptr_of(b.data, b.indices, b.indptr)], dtype=np.uint64)
    return sparse_compute_xN(raw_data, u, B[0].indices)

This code is unfortunately not faster than its equivalent sequential version (not wrapping all arrays). This is mostly because about half the time is spent in wrapping the input data into something that can be read in a Numba function having a parallel loop. On top of that, the Numba parallel loop also does not seems to scale because of allocations (certainly nb.carray). This is due to the current Numba implementation and there is not much to do besides not using Numba for that... This is disappointing and mainly show that we push Numba to its limits so it is not really design for such a use-case.

Note the Numba code is rather unsafe though it appear to return the right results. Indeed, void pointers pointer can be dangerous and arrays containing different kind of data can quickly result in the crash of the whole interpreter if you do any mistake (in fact, this is often the case with Numba when out-of-bound happen though they are easier to debug). The purpose of providing this code is to share and learn how the above approach can be done in practice.

An interesting alternative is to use Cython which better support this (note Cython cannot operate on reflected lists in parallel loop either due to the GIL like in Numba, though lists can be converted to fast C arrays more easily than with Numba). Honestly, if you want better performance, I strongly advise not to use Python at all in this part of your code, but a code written in a native language instead. Still, if you cannot (or don't want to) fully convert this in a native module, then Cython is the way to go.

By the way, using arrays containing 16-bit integers in input sparse matrices instead of 32-bit (default type on Windows) or 64-bit ones (default type on Linux) should be faster. Indeed, it reduces the pressure on the memory bandwidth (especially DRAM), and it should make the operation more cache-friendly.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

performance - How to improve efficiency of my python function involving sparse matrices? - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)