rdma - GPUDirect Implementation Challenges with NVIDIA RTX A4000

We are currently working on transitioning from AMD (Radeon PRO W6800) to NVIDIA graphics cards due to hardware availability and implementing NVIDIA’s GPUDirect feature.

We implemented the following GPUDirect workflow. However, we observed that dvpMemcpyLined blocks significantly affected performance until completion.

dvpBegin();
nvrDebug() << "begin" << timer.elapsed();
dvpMapBufferWaitDVP(textureHandle);
nvrDebug() << "wait" << timer.elapsed();

switch (direction) {
case Direction::DownloadFromGPU:
  status = dvpMemcpyLined( //
      textureHandle, externalSync->mDvpSync, externalSync->mReleaseValue,
      DVP_TIMEOUT_IGNORED,                                     //
      bufferHandle, gpuSync->mDvpSync, gpuSync->mReleaseValue, //
      0, uint32_t(height));
  break;

case Direction::UploadToGPU:
  status = dvpMemcpyLined( //
      bufferHandle, externalSync->mDvpSync, externalSync->mAcquireValue,
      DVP_TIMEOUT_IGNORED,                                      //
      textureHandle, gpuSync->mDvpSync, gpuSync->mReleaseValue, //
      0, uint32_t(height));
  break;
}
nvrDebug() << "copy" << timer.elapsed();
dvpMapBufferEndDVP(textureHandle);
nvrDebug() << "end" << timer.elapsed();
dvpEnd();
nvrDebug() << "the end" << timer.elapsed();

We also tested a naive implementation of texture data copy:

auto frameFormat = frame->format();
void *data = nullptr;
if (!buffer.amd.isNull())
  buffer.amd->bind();
else
  data = frame->storage();

auto *openGLFunctions =
    QOpenGLVersionFunctionsFactory::get<QOpenGLFunctions_4_5_Core>(context);

QElapsedTimer timer;
timer.start();
switch (direction) {
case Direction::DownloadFromGPU:
  texture->bind();
  openGLFunctions->glGetTexImage(GL_TEXTURE_2D, 0,
                                 openGLPixelFormatFor(frameFormat),
                                 openGLPixelTypeFor(frameFormat), data);
  texture->release();
  break;

case Direction::UploadToGPU:
  texture->setData(pixelFormatFor(frameFormat), pixelTypeFor(frameFormat),
                   data);
  break;
}
nvrDebug() << "copy" << timer.elapsed();

if (!buffer.amd.isNull())
  buffer.amd->release();

AMD Radeon PRO W6800: Up to 7ms to transfer a 4K frame using the naive implementation. NVIDIA RTX A4000: Up to 50ms, even when using the GPUDirect flow.

Additionally, in the NVIDIA setup, the lspci -vvv output shows that NVIDIA drivers downscale the PCIe transfer speed. Forcing PCIe link speed to maximum has no effect.

Are there known issues with GPUDirect and PCIe link speed management on NVIDIA RTX A4000 cards?
Has anyone encountered similar performance discrepancies between AMD’s GL_AMD_pinned_memory and NVIDIA’s GPUDirect workflows?
Any recommendations for optimizing data transfer in this context, particularly for 4K frame transfers in real-time production environments?

We are exploring other possible routes. We found GL_ARB_storage_buffer extension and saw it addresses a similar area as AMD’s GL_AMD_pinned_memory extension. Can pinned address from glMapBuffer be used by other hardware? We are on Epyc platform without any PCIe switches between devices.

Our hardware is BlackMagic Decklink 8k Pro with Nvidia RTX A4000 on an Epyc motherboard. We stream frames from BlackMagic capture card to system memory and to GPU memory. On GPU we render frames which are transfered to system memory and to BlackMagic playback card. We need frames on system memory for recordings and snapshots. Snapshot trigger has to capture frames that occurred up to 3 seconds in the past.

Thank you for any information!

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

rdma - GPUDirect Implementation Challenges with NVIDIA RTX A4000 - Stack Overflow

与本文相关的文章

评论列表(0)