最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

cuda - N-way bank conflict on GPU shared memory in 64-bit mode and access order across words - Stack Overflow

programmeradmin0浏览0评论

I have been read the book "Professional CUDA C Programming" and it shows two cases of bank conflicts:

  1. Two-way bank conflict

  2. Three-way bank conflict

Figure below is how the words are mapped to each bank described in the book (64-bit mode, Kepler device)

I have two question:

  1. In that figure, what is the exact priority (order) if threads access the 0, 96, 128 word in bank 0 at the same time? I can't understand how the words are processed which is in the same bank but are separated exactly 8-byte (64-bit).

  2. What is the bank exactly?

    • What is stored in the bank?
    • How does the bank look like?

I have been read the book "Professional CUDA C Programming" and it shows two cases of bank conflicts:

  1. Two-way bank conflict

  2. Three-way bank conflict

Figure below is how the words are mapped to each bank described in the book (64-bit mode, Kepler device)

I have two question:

  1. In that figure, what is the exact priority (order) if threads access the 0, 96, 128 word in bank 0 at the same time? I can't understand how the words are processed which is in the same bank but are separated exactly 8-byte (64-bit).

  2. What is the bank exactly?

    • What is stored in the bank?
    • How does the bank look like?
Share Improve this question asked Mar 2 at 7:40 kdhkdh 1302 silver badges8 bronze badges 5
  • 1 Point 1 cannot be observed. The warp continues once all threads have received their data. In a 3-way conflict that takes 3 access cycles. That's all that can be said. Point 2: What is stored in a bank? The memory that is addressed through it. What kind of answer do you expect? If it helps, think of each bank as an independent memory subsystem that can access one word of its memory (typically 4 byte, in Kepler sometimes 8) per cycle – Homer512 Commented Mar 2 at 10:56
  • Thanks for your comment. My question was something wrong... – kdh Commented Mar 2 at 12:06
  • 2 For the record, I wasn't the person down-voting the question and I personally consider it poor form to down-vote without commenting or up-voting a critical comment. However, I think your book might be a bit outdated. Kepler with its 8 byte banks is more than 10 years old. I don't think you should concern yourself with that stuff anymore. Banks in general are of course still around but that whole diagram would be a lot less confusing without Kepler-specific details – Homer512 Commented Mar 2 at 12:16
  • Wow! Thanks for your consideration. I know that book is old, but I think its explanation was good for me. Then how about the book "Programming massively parallel processors 4th (2022)"? – kdh Commented Mar 2 at 12:31
  • The Kepler details also confused me for a while. On newer GPUs there is no 64 bit mode. Banks are always 32 bit... – paleonix Commented Mar 2 at 17:13
Add a comment  | 

1 Answer 1

Reset to default 0

The book you are reading describes a 15 year old, outdated, micro-architecture that does not resemble current GPU hardware.

In the very next version (compute capability 5) Maxwell shared memory workings were changed and from then on it works the same for all subsequent micro-architectures.

  1. A bank is 32 bits wide.
  2. There are 32 banks.
  3. If two threads access the same bank, but a different address then you have a bank conflict.
  4. If you have a bank conflict then that shared memory access will be serialized and cannot execute in parallel.

See: https://docs.nvidia/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x

Let me draw that for you. Threads are numbers A, B, ... and addresses go in the normal order, note that these are hex numbers (10 hex = 16 dec).

Threads: A, B, ...

  0   1   2   3   4  ...  10  11  12  13  14  15 ... 20  21  
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| A | B | C | D |   |   | E | F | G |   | H | I |   | J | K |

Here threads A, B, C, D, H, I can fetch in parallel on cycle 1.
Threads E, F, G have a bank conflict and have to wait. On cycle 2 threads E, F, G can fetch, but threads J, K have a bank conflict with E, F, so they have to wait.
Next cycle J, K can fetch. Because there are 2 waits, this is a 3-way bank conflict.

Another example

Threads: A, B, ...

  0   1   2   3   4  ...  10  11  12  13  14  15 ... 20  21  
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| A | B | C | D |   |   |   |   |   |   | H | I |   |   |   |
| E | F | G |   | H |
| J | K |

Here E, F, G and J, K are accessing the same bank and the same address as A..C so there is no bank conflict, the same address is a broadcast and does not incur a wait.

All threads can fetch data in the same cycle.

Note that Maxwell and beyond have 32-bit banks and 32-bit registers. If you load 64 bits, then the GPU will load the first 32 bits in cycle 1 and the next 32 bits in cycle 2.

ergo in pseudo-assembly

LDS.u64 R0, [0x10]

Will first load shared memory address #10 into R0, and after that address #11 into R1. (64-bit loads always start on an even register).

Note 1: yes I know that memory loads from shared memory take more than 1 cycle per load (more like 4), but let's assume a load takes a single cycle for argument's sake.

Note 2: interestingly CC 4 does not exist, I guess it got canceled and 5 was a redesign

发布评论

评论列表(0)

  1. 暂无评论