最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

c++23 - Should `alignof` be used when serializing data to a buffer in C++? - Stack Overflow

programmeradmin2浏览0评论

Recently I learned that when reading and writing data from a memory address, the data must be aligned correctly to avoid potential issues with undefined behaviour.

On some platforms (for example x86) rather than undefined behaviour, there is a performance penalty. This is caused by the compiler producing code which contains multiple loads in place of what would have been a single load for correctly aligned data.

Further, I understand that in many cases, the required alignment is the same width as the datatype. However, this is not a requirement or a rule and there may be exceptions to this.

My understanding is that the alignof operator can be used to correctly align data. This is similar to how sizeof can be used to correctly allocate memory size for data.

I want to write some data serialization and deserialization code to read and write data into a buffer. This data will be sent via a network socket between multiple machines. In this case, it would be reasonable to assume that the machines will all have the same endianness to avoid the overhead of needing to discuss converting between host and network byte order.

The question is how to do this?

For example, if I wanted to send an arbitrary sequence of data, how should I use alignof, or even should I use alignof to ensure that the code is not at risk of undefined behaviour?

To provide a concrete example, it may be the case that I might want to serialize a uint64_t, followed by a int8_t, followed by a 32 bit float.

The naieve way to do it is to write the 8 bytes of the uint64_t followed by the single byte for the int8_t followed by the 4 bytes for the float.

However, I think that while the first two elements will be correctly aligned, the final float will certainly not be.

Recently I learned that when reading and writing data from a memory address, the data must be aligned correctly to avoid potential issues with undefined behaviour.

On some platforms (for example x86) rather than undefined behaviour, there is a performance penalty. This is caused by the compiler producing code which contains multiple loads in place of what would have been a single load for correctly aligned data.

Further, I understand that in many cases, the required alignment is the same width as the datatype. However, this is not a requirement or a rule and there may be exceptions to this.

My understanding is that the alignof operator can be used to correctly align data. This is similar to how sizeof can be used to correctly allocate memory size for data.

I want to write some data serialization and deserialization code to read and write data into a buffer. This data will be sent via a network socket between multiple machines. In this case, it would be reasonable to assume that the machines will all have the same endianness to avoid the overhead of needing to discuss converting between host and network byte order.

The question is how to do this?

For example, if I wanted to send an arbitrary sequence of data, how should I use alignof, or even should I use alignof to ensure that the code is not at risk of undefined behaviour?

To provide a concrete example, it may be the case that I might want to serialize a uint64_t, followed by a int8_t, followed by a 32 bit float.

The naieve way to do it is to write the 8 bytes of the uint64_t followed by the single byte for the int8_t followed by the 4 bytes for the float.

However, I think that while the first two elements will be correctly aligned, the final float will certainly not be.

Share Improve this question asked Feb 7 at 17:55 user2138149user2138149 16.8k30 gold badges145 silver badges287 bronze badges 7
  • 3 You should use a proper binary serialization library. E.g. flatbuffers or protobuf. There are many more pitfalls when interpreting raw binary data than you might think (endianness, type casting not starting lifetimes of objects etc... etc..). – Pepijn Kramer Commented Feb 7 at 18:00
  • 3 Use a serialization library ( like Protobuf ). Usually with these libraries you define the serialization data in a meta language and tools included with the library produce language specific data structures. – Richard Critten Commented Feb 7 at 18:01
  • 1 In other words, accept the "performance" penalty first, because code with UB is far worse. I would only "worry" about "optimization" if you have a perfomance issue you can prove with a profiler. – Pepijn Kramer Commented Feb 7 at 18:04
  • Writing into the buffer shouldn't give a crap about alignment. Unless you're doing something really weird, a buffer will just be an array of bytes. Reading the stuff back out of the buffer at the other side usually you are copying bytes out of the buffer and into the a variable of the correct type and is already correctly aligned. If your serialization protocol is the source data structure then you have to watch your butt. But you also aren't really serializing. – user4581301 Commented Feb 7 at 18:05
  • Hard to answer in general, and depends on the field. An extremely common problem in embedded systems programming. Most embedded systems codebases I have worked with do this by defining their data structures as packed, and then being very conscious about manually adding padding bytes where appropriate. They do not usually leave such things up to the compiler, because compilers in this field are highly heterogeneous. Endianness typically remains an annoying remnant, often solved with macros when accessing the data. They also tend to avoid bitfields for similar reasons. Add offsets in comments. – dialer Commented Feb 7 at 18:33
 |  Show 2 more comments

1 Answer 1

Reset to default 0

The data must be aligned correctly to avoid potential issues with undefined behaviour.

That is not true, generally. You should be able to write safe programs, without undefined behaviour, without dealing at all with alignment. If you have any specific case against this idea, or any specific compiler/architecture that does not hold to this, please post it.

On some platforms (for example x86) rather than undefined behaviour, there is a performance penalty.

Yes, some datatypes are faster to be loaded into registers if they are properly aligned

I understand that in many cases, the required alignment is the same width as the datatype

Yes, and the reason is the same than above. If that helps, you can imagine that somehow, registers are also "aligned", and so moving several bytes to some register is faster if the alignments match.

I want to write some data serialization and deserialization code to read and write data into a buffer. This data will be sent via a network socket between multiple machines. In this case, it would be reasonable to assume that the machines will all have the same endianness to avoid the overhead of needing to discuss converting between host and network byte order.

There are entire books dedicated to binary serialization formats, and yes, alignment, endianness, and precission are key factors to them. My actual answer, if you are in fact presented with the challenge to send data over the network, is to stick with any already established cross-language binary protocol.

Examples:

  1. protobuf https://protobuf.dev/
  2. thrift https://thrift.apache.org/
  3. Binary JSON https://bsonspec.org/
发布评论

评论列表(0)

  1. 暂无评论