最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Optimized binary matrix inverse - Stack Overflow

programmeradmin1浏览0评论

I am attempting to invert (mod 2) a non-sparse binary matrix. I am trying to do it with very very large matricies, like 100_000 by 100_000. I have tried libraries, such as sympy, numpy, numba. Most of these do this with floats and don't apply mod 2 at each step. In general, numpy reports the determinant of a random binary invertible array of size 1000 by 1000 as INF.

Here is my attempted code, but this lags at 10_000 by 10_000.

This function is also nice, since even if I am not invertible, I can see the rank.

def binary_inv(A):
    n = len(A) 
    A = np.hstack((A.astype(np.bool_),np.eye(n,dtype=np.bool_)))
    for i in range(n):
        j = np.argmax(A[i:,i]) + i
        if i!=j:
            A[[i,j],i:] = A[[j,i],i:]
        A[i,i]          = False
        A[A[:,i],i+1:] ^= A[i:i+1,i+1:]
    return A[:,n:]

Any thoughts on making this as fast as possible?

I am attempting to invert (mod 2) a non-sparse binary matrix. I am trying to do it with very very large matricies, like 100_000 by 100_000. I have tried libraries, such as sympy, numpy, numba. Most of these do this with floats and don't apply mod 2 at each step. In general, numpy reports the determinant of a random binary invertible array of size 1000 by 1000 as INF.

Here is my attempted code, but this lags at 10_000 by 10_000.

This function is also nice, since even if I am not invertible, I can see the rank.

def binary_inv(A):
    n = len(A) 
    A = np.hstack((A.astype(np.bool_),np.eye(n,dtype=np.bool_)))
    for i in range(n):
        j = np.argmax(A[i:,i]) + i
        if i!=j:
            A[[i,j],i:] = A[[j,i],i:]
        A[i,i]          = False
        A[A[:,i],i+1:] ^= A[i:i+1,i+1:]
    return A[:,n:]

Any thoughts on making this as fast as possible?

Share Improve this question edited Feb 7 at 2:19 Bobby Ocean asked Feb 7 at 0:48 Bobby OceanBobby Ocean 3,3282 gold badges9 silver badges16 bronze badges 5
  • @aaa Numpy is nearly pure C code. It is true I might get a factor increase, I think that takes it from days of compute to hours of compute....maybe. The real issue is that I am probably not doing something good enough, like is there a way to divide the problem into smaller pieces. – Bobby Ocean Commented Feb 7 at 2:17
  • Your code seems okay. You can do partitioning and perform block-by-block inverse, should of course figure out the math. Consider GPUs or TPUs, if you have access to good resources. Also look into parallel processing. Also see. – aaa Commented Feb 7 at 2:50
  • @aaa I appreciate your input. I agree speed up can be achieved with things like GPUs, TPUs, doing pure C code, etc. But my question is more about finding a more efficient algorithm. Like is that even possilbe...to block split the data and solve for inverse a portion at a time....I don't know if that can be done. – Bobby Ocean Commented Feb 7 at 3:18
  • See this – aaa Commented Feb 7 at 3:23
  • @aaa Mine is not sparse. Imagine a matrix in which every entry is 50/50 a 0 or 1. You have like a probability of finding a 5 by 5 zero section of like 2e-8, much less anything significant enough to gain savings from sparsity. – Bobby Ocean Commented Feb 7 at 4:02
Add a comment  | 

1 Answer 1

Reset to default 2

First of all, the algorithmic complexity of a matrix inversion is identical to the complexity of a matrix multiplication. A naive matrix multiplication has a complexity of O(n**3). The best practical algorithm known so far is Strassen with a complexity of ~O(n**2.8) (other can be considered as galactic algorithms). This complexity improvement comes with a bigger hidden constant so it only worth it for really large matrices already like one bigger than 10000x10000. On such big matrices, Strassen can certainly results in a speed up of about 2x to 4x (because Strassen on small matrices is not SIMD-friendly). This is far from being great considering that a code implementing Strassen efficiently is generally noticeably more complex than one implementing efficiently the naive algorithm (which is actually still used by most BLAS library nowadays). Actually, the biggest improvements that can be achieved come from basic (low-level) code optimizations, not algorithmic one.

One major optimization to perform is to pack boolean values into bits. This is not done by Numpy automatically (Numpy uses 1 byte per boolean value). This enable the CPU to operate on 8 times more items with a very small overhead. Besides, it also reduces the memory space used by the algorithm by the same amount so the operation becomes more cache friendly.

Another major optimization is to compute this operation with multiple thread. Indeed, modern mainstream personal CPUs often have 6~8 cores (significantly more on computing server) so you should expect the computation to be about 3~6 faster. This is not as good as the number of cores because Gauss-Jordan elimination does not scale well. Indeed, it tends to be memory-bound (this is why packing bits is actually so important).

Yet another optimization consists in computing the matrix block by block so to make it more cache-friendly. This kind of optimization resulted in a roughly twice faster code the last time I tested it (with relatively small blocks). In practice, here, it might be even faster since you operate on a GF(2) field.

Applying such optimization on a Numpy code without introducing substantial overheads is difficult if even possible. I strongly advise you to write this in a native language like C, C++, or Rust. Numba can help but there are known performance issues (mainly vectorization) with packing/unpacking.

You should be careful to write a SIMD-friendly code because SIMD instructions can operate on much more items than scalar ones (e.g. the AVX2 instruction set can operate on 32 x 8-bit items = 256-bit at a time at a cost similar to scalar instructions).

If all of this is not enough, you can move this on GPUs and get again a significant improvement (let set 4~20 regarding the target GPU). Note you should pack boolean values into 32-bit integers to get good performances. On top of that, tensor cores might help to push this limit further if you succeed to leverage them. Note that applying some of the above optimization on GPU is actually pretty hard (especially blocking) even for skilled developers. Once optimized, I expect a very-good GPU implementation to take less than 10 minutes on a recent mid/high-end GPU.

All of this should be enough to make the operation realistically 2 to 3 orders of magnitude faster!

发布评论

评论列表(0)

  1. 暂无评论