最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

rdma - GPUDirect Implementation Challenges with NVIDIA RTX A4000 - Stack Overflow

programmeradmin2浏览0评论

We are currently working on transitioning from AMD (Radeon PRO W6800) to NVIDIA graphics cards due to hardware availability and implementing NVIDIA’s GPUDirect feature.

We implemented the following GPUDirect workflow. However, we observed that dvpMemcpyLined blocks significantly affected performance until completion.

dvpBegin();
nvrDebug() << "begin" << timer.elapsed();
dvpMapBufferWaitDVP(textureHandle);
nvrDebug() << "wait" << timer.elapsed();

switch (direction) {
case Direction::DownloadFromGPU:
  status = dvpMemcpyLined( //
      textureHandle, externalSync->mDvpSync, externalSync->mReleaseValue,
      DVP_TIMEOUT_IGNORED,                                     //
      bufferHandle, gpuSync->mDvpSync, gpuSync->mReleaseValue, //
      0, uint32_t(height));
  break;

case Direction::UploadToGPU:
  status = dvpMemcpyLined( //
      bufferHandle, externalSync->mDvpSync, externalSync->mAcquireValue,
      DVP_TIMEOUT_IGNORED,                                      //
      textureHandle, gpuSync->mDvpSync, gpuSync->mReleaseValue, //
      0, uint32_t(height));
  break;
}
nvrDebug() << "copy" << timer.elapsed();
dvpMapBufferEndDVP(textureHandle);
nvrDebug() << "end" << timer.elapsed();
dvpEnd();
nvrDebug() << "the end" << timer.elapsed();

We also tested a naive implementation of texture data copy:

auto frameFormat = frame->format();
void *data = nullptr;
if (!buffer.amd.isNull())
  buffer.amd->bind();
else
  data = frame->storage();

auto *openGLFunctions =
    QOpenGLVersionFunctionsFactory::get<QOpenGLFunctions_4_5_Core>(context);

QElapsedTimer timer;
timer.start();
switch (direction) {
case Direction::DownloadFromGPU:
  texture->bind();
  openGLFunctions->glGetTexImage(GL_TEXTURE_2D, 0,
                                 openGLPixelFormatFor(frameFormat),
                                 openGLPixelTypeFor(frameFormat), data);
  texture->release();
  break;

case Direction::UploadToGPU:
  texture->setData(pixelFormatFor(frameFormat), pixelTypeFor(frameFormat),
                   data);
  break;
}
nvrDebug() << "copy" << timer.elapsed();

if (!buffer.amd.isNull())
  buffer.amd->release();

AMD Radeon PRO W6800: Up to 7ms to transfer a 4K frame using the naive implementation. NVIDIA RTX A4000: Up to 50ms, even when using the GPUDirect flow.

Additionally, in the NVIDIA setup, the lspci -vvv output shows that NVIDIA drivers downscale the PCIe transfer speed. Forcing PCIe link speed to maximum has no effect.

  1. Are there known issues with GPUDirect and PCIe link speed management on NVIDIA RTX A4000 cards?
  2. Has anyone encountered similar performance discrepancies between AMD’s GL_AMD_pinned_memory and NVIDIA’s GPUDirect workflows?
  3. Any recommendations for optimizing data transfer in this context, particularly for 4K frame transfers in real-time production environments?

We are exploring other possible routes. We found GL_ARB_storage_buffer extension and saw it addresses a similar area as AMD’s GL_AMD_pinned_memory extension. Can pinned address from glMapBuffer be used by other hardware? We are on Epyc platform without any PCIe switches between devices.

Our hardware is BlackMagic Decklink 8k Pro with Nvidia RTX A4000 on an Epyc motherboard. We stream frames from BlackMagic capture card to system memory and to GPU memory. On GPU we render frames which are transfered to system memory and to BlackMagic playback card. We need frames on system memory for recordings and snapshots. Snapshot trigger has to capture frames that occurred up to 3 seconds in the past.

Thank you for any information!

发布评论

评论列表(0)

  1. 暂无评论