I'm trying to debug a new issue that's cropped up in a new distro of our software. I've got a theory for what's going on, but I'd like to run it by people who know how UDP works.
We have a custom protocol we've been using for years that's built on top of UDP. The basic idea is that 2 networked systems each have a list of buffers that they'd like to transmit to the other in real time at a high-ish rate (like 100HZ).
So what they both do is go through the list ahead of time and divvy them up into datagrams. Each datagram gets buffers assigned to it until no more buffers can fit in a datagram, at which point it adds another datagram, and continues through the list. This repeats until its made datagrams for all the buffers. So at startup time it figures out how many UDP datagrams it needs to send, and which buffers go into each.
The dataflow of the protocol is that the receiving side wakes up at the 100Hz tick and in a tight loop for each datagram buffer -> copies the local buffer data into a datagram buffer, and then does a socket send() of the datagram buffer.
The sending side likewise in a tight loop does a recv() into its datagram buffer with a 0 timeout (nowait I believe) until that recv() call returns with no data. Any data it recieves is copied into the appropriate local buffer before the next recv() call happens.
This worked (supposedly) fine for years, although I understand typically the system was only requiring 2 UDP datagrams. However, we got built a new system recently with enough new buffers that it now needs 3 UDP datagrams. The report I got was that the first 2 were going over fine, but the data in the 3rd datagram wasn't.
I tried this out on our testbench, and it looked like I got it working. However, it definitely isn't on the actual customer equipment. We can see on Wireshark that all 3 datagrams are getting transmitted, but we can see in our software that its only ever seeing the first 2. If I'm right that its working fine on our testbench, this looks an awful lot like either a hardware difference, or a race condition (likely the race ending differently due to hardware differences). Supporting this is that I think I've momentarily seen the customer equipment actually receive one of the 3 datagrams.
So here's my theory for what's going on here. I'd like feedback from UDP gurus on how feasible this sounds:
This user-level protocol of ours on top of UDP is unsynchronized, and that's causing a race between the tight send() loop on the sender and the tight nowait recv() on the receiver's side. After the second datagram is send()'ed from the sender, the receiver may somehow be recv()-ing it, processing it, and posting its no-wait revc() before the sender manages to send() its third (and last) datagram. This shows up on the recv() side as the sender being done (because recv() returns with 0 data).
So my theory is that this protocol needs to be tweaked a bit. Either back-and-forth, handshaking needs to be added, or the recv() side needs to block with a reasonable timeout, and/or perhaps do something with the knowledge of how many datagrams its supposed to be getting (it knows that at startup time).