python - Lowpass filter is slower on GPU than CPU in PyTorch

I have been trying out some of the Torchaudio functionalities and I can't seem to figure out why lowpass_biquad is running slower on the GPU than on the CPU. And this is true for other effects like, phaser, flanger, overdrive, which are even slower. Here I am pasting the example for the lowpass filter, but I apply the other effects the same way to obtain the measurements. The example code is taken from this issue, which seems to have been resolved, but it's still slower on the GPU.

import time
import torch
from torchaudio.functional import lowpass_biquad

gpu_device = torch.device('cuda:0')
cpu_device = torch.device('cpu')

seconds = 1000
sample_rate = 44100
cutoff_freq = 1000.
Q = .7

# Run in cpu
x = torch.rand(sample_rate * seconds, device=cpu_device)
begin = time.time()
y = lowpass_biquad(x, sample_rate, cutoff_freq, Q)

print(f'Run in cpu: {time.time() - begin}')

# Run in gpu
x = torch.rand(sample_rate * seconds, device=gpu_device)
begin = time.time()
y = lowpass_biquad(x, sample_rate, cutoff_freq, Q)
torch.cuda.synchronize()
print(f'Run in gpu: {time.time() - begin}')

Run in cpu: 1.6084413528442383
Run in gpu: 6.183292865753174

For example for the overdrive effect, the GPU is more than 1000x slower. It would be understandable, if the Torchaudio doesn't have the GPU implementation of the said effects, but their documentation seems to suggest they do. Am I doing something wrong?

import time
import torch
from torchaudio.functional import lowpass_biquad

gpu_device = torch.device('cuda:0')
cpu_device = torch.device('cpu')

seconds = 1000
sample_rate = 44100
cutoff_freq = 1000.
Q = .7

# Run in cpu
x = torch.rand(sample_rate * seconds, device=cpu_device)
begin = time.time()
y = lowpass_biquad(x, sample_rate, cutoff_freq, Q)

print(f'Run in cpu: {time.time() - begin}')

# Run in gpu
x = torch.rand(sample_rate * seconds, device=gpu_device)
begin = time.time()
y = lowpass_biquad(x, sample_rate, cutoff_freq, Q)
torch.cuda.synchronize()
print(f'Run in gpu: {time.time() - begin}')

Run in cpu: 1.6084413528442383
Run in gpu: 6.183292865753174

Share Improve this question edited Feb 6 at 10:22 asked Feb 5 at 19:35 orglce 5432 gold badges7 silver badges19 bronze badges

1 Try benchmark multiple times. The first time you run GPU code, there's a lot of overhead from loading generic kernels into memory. Also try comparing with a batch of inputs – Karl Commented Feb 5 at 20:26
1 With batch processing the lowpass_biquad is faster now, but not by a lot. Around 20x improvement. I would still think it would be better. But for the overdrive effects for example, its almost the same, the GPU is a lot slower, which just makes me think there is no GPU implementation? – orglce Commented Feb 6 at 10:19
Clearly there's a GPU implementation, otherwise you wouldn't see a CPU/GPU difference. The issue you linked mentions the lfilter CPU kernel is more optimized than the lfilter GPU kernel. That's likely the underlying reason. – Karl Commented Feb 6 at 22:45

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

A simple way to check if this code use the GPU is to profile it. Actually checking nvidia-smi while running it is also a simple way to see that.

In a profiler, we can see that 26 CUDA kernels are executed showing that the program does use the GPU.

The profiler also reports that some error happens during the execution (non-zero return code) which is suspicious (the error is not provided so debugging is needed for more information). I advise you to carefully check results.

Among all the kernels executed, 1 takes nearly all the time (>99%): the kernel named iir_cu_kernel. It takes about 7-9 seconds on my Nvidia 1660S GPU. Meanwhile the CPU implementations takes 0.4 seconds...

One main issue is that this main kernel uses only 1 block! This is ridiculously small. My GPU contains 22 SM and each of them needs at least 2 blocks so not to starve. To hide the latency of the kernel, my GPU generally needs 8-16 blocks per SM (each SM can execute up to 16 blocks concurrently). This is like using a single core of a 128-core CPU! Actually, the profiler reports that less than 1% of the GPU is actually used...

Running multiple times the operation does not make it faster. The kernel is just run asynchronously and multiple kernels can run simultaneously. On my machine, the second execution results in a device synchronization so only 2 kernels can overlap which is far from being enough to use a GPU efficiently...

I thought that the number of blocks was simply due to the number of row of the tensor but adding more rows did not increase the number of blocks used (though the execution time did not increase either).

I think the main issue is simply that the implementation is very inefficient so far (not GPU friendly) and it can certainly be massively improved. In fact, this is also what this message (of the PR mentioned in the question) tends to indicate (they get a speed up of 100x~150x).

So far, I see 3 options for you: open a new issue explaining that and wait for a faster implementation to be merged (possibly few years), implementing this yourself (please contribute to torch if you can then), find a library doing that better than torchaudio (IDK which one).

It would also be good to understand why the number of block is 1, but I think it require to perform a deep analysis of the torch's code.

Please note that my version of Torch is the latest one on pypi at the time of writing: the version 2.6.0+cu124. It has been released just a week ago!

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Lowpass filter is slower on GPU than CPU in PyTorch - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)