I have a simple function that performs a sliding dot product using an overlap-add convolution approach:
import numpy as np
from scipy.signal import oaconvolve
import pyfftw
import os
def scipy_sliding_dot(A, B):
m = A.shape[0]
n = B.shape[0]
Ar = np.flipud(A) # Reverse/flip A
AB = oaconvolve(Ar, B)
return AB.real[m - 1 : n]
For reference, this is the same thing as doing:
def naive_sliding_dot(A, B):
m = len(A)
n = len(B)
l = n - m + 1
out = np.empty(l)
for i in range(l):
out[i] = np.dot(A, B[i:i+m])
return out
When I initialize two random (always-real, never complex) arrays:
A = np.random.rand(2**6)
B = np.random.rand(2**20)
and then time scipy_sliding_dot
with:
%timeit scipy_sliding_dot(A, B)
I get:
6.39 ms ± 38.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I then attempt to speed this up with multi-threaded pyfftw
:
class pyfftw_sliding_dot(object):
# Based on
def __init__(self, A, B, threads=1):
shape = (np.array(A.shape) + np.array(B.shape))-1
self.rfft_A_obj = pyfftw.builders.rfft(A, n=shape, threads=threads)
self.rfft_B_obj = pyfftw.builders.rfft(B, n=shape, threads=threads)
self.irfft_obj = pyfftw.builders.irfft(self.rfft_A_obj.output_array, n=shape, threads=threads)
def __call__(self, A, B):
m = A.shape[0]
n = B.shape[0]
Ar = np.flipud(A) # Reverse/flip A
rfft_padded_A = self.rfft_A_obj(Ar)
rfft_padded_B = self.rfft_B_obj(B)
return self.irfft_obj(np.multiply(rfft_padded_A, rfft_padded_B)).real[m - 1 : n]
Then, I test the performance with:
n_threads = os.cpu_count()
obj = pyfftw_sliding_dot(A, B, n_threads)
%timeit obj(A, B)
and get:
33 ms ± 347 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
which means that multi-threaded pyfftw
is ~5x slower than scipy
. I've poured through the builders documentation and played around with all of the "additional arguments" (e.g., planner_effort
, overwrite_input
, etc) but the pyfftw
performance does not change.
What am I doing wrong with pyfftw
and how can I make it faster than scipy
?