c++ - Avoiding UB while reading binary data from std::ifstream

I'm reviewing classical object (de)serialization code from file but I'm wondering if it is UB.

I'm making simplifying hypotheses:

in next snippet, the corresponding file is supposed well formed (a std::uint32_t, a std::int64_t and a float with the right endianness; the float being in the same representation as in the program)
I'm reading only implicit lifetime types, trivially copyable, trivially destructible.

struct Content
{
    std::uint32_t first;
    std::int64_t second;
    float last;
}

std::string Path(<Some valid path>);    // path to a binary file holding only data of implicit lifetime type
std::ifstream is(Path, std::ios::binary);
Content content;
is.read(reinterpret_cast<char *>(&content.first), sizeof(content.first));
is.read(reinterpret_cast<char *>(&content.second), sizeof(content.second));
is.read(reinterpret_cast<char *>(&content.last), sizeof(content.last));

I know that this kind of code is used without issues for ages but is the reinterpret_cast legal in this case and why ?

Or should we go for:

char buffer[sizeof(std::size_t)];
is.read(buffer, sizeof(content.first));
std::memcpy(&content.first,buffer,sizeof(content.first));
...

char buffer[sizeof(std::size_t)];
is.read(buffer, sizeof(content.first));
content.first=std::bit_cast<std::uint32_t>(buffer);
...

I'm reviewing classical object (de)serialization code from file but I'm wondering if it is UB.

I'm making simplifying hypotheses:

in next snippet, the corresponding file is supposed well formed (a std::uint32_t, a std::int64_t and a float with the right endianness; the float being in the same representation as in the program)
I'm reading only implicit lifetime types, trivially copyable, trivially destructible.

struct Content
{
    std::uint32_t first;
    std::int64_t second;
    float last;
}

std::string Path(<Some valid path>);    // path to a binary file holding only data of implicit lifetime type
std::ifstream is(Path, std::ios::binary);
Content content;
is.read(reinterpret_cast<char *>(&content.first), sizeof(content.first));
is.read(reinterpret_cast<char *>(&content.second), sizeof(content.second));
is.read(reinterpret_cast<char *>(&content.last), sizeof(content.last));

I know that this kind of code is used without issues for ages but is the reinterpret_cast legal in this case and why ?

Or should we go for:

char buffer[sizeof(std::size_t)];
is.read(buffer, sizeof(content.first));
std::memcpy(&content.first,buffer,sizeof(content.first));
...

char buffer[sizeof(std::size_t)];
is.read(buffer, sizeof(content.first));
content.first=std::bit_cast<std::uint32_t>(buffer);
...

Share Improve this question asked Mar 17 at 16:53 Oersted 2,9836 silver badges29 bronze badges

1 If you runt this code on platform with different endianes it will fail. – Marek R Commented Mar 17 at 17:00
1 @marek-r I specified in the question that I suppose the endianness to be correct, the question is not about serialization in general. – Oersted Commented Mar 17 at 17:11
1 The read into the float is a recipe for disaster. I've seen code with this that works "sometimes" and other times not - even if read returns 4 as expected. Just don't do that. Read into a buffer and bit_cast or memcpy from that. – Ted Lyngmo Commented Mar 17 at 17:59
1 @pepijn-kramer out of curiosity, do you know of resources that would explain how serialization libraries are implemented to avoid all pitfalls? BTW, my question was restricted with "friendly" types (at least for integral ones). Regarding lifetime, destination object are alive because explicitly constructed. read has only to copy the object representation. – Oersted Commented Mar 17 at 18:03
1 I don't have any resources for details, but usually there is some kind of high level data description file (not in C++, e.g. proto files for protobuf). Then internally they have very strict rules, like an int takes 4 bytes precisely and is in little endian. A string starts with a size and then number of bytes etc. etc. So they build up a (binary) format that's independent of the actual memory layout C++ uses. Which can then be written to disk/network or a string buffer. Deserialization is done to a data object (not an original C++ class). So it is all about data, not object instances. – Pepijn Kramer Commented Mar 17 at 18:11

| Show 10 more comments

1 Answer 1

Sorted by: Reset to default 0

Eventually, I found that, under reinterpret_cast conversion: [expr.reinterpret.cast], the cast by itself is of course valid.

Then under type aliasing: [basic.lval] I can access the object representation of the data members through a glvalue of type char. This glvalue is the first parameter of std::ifstream::read, initialized by the reinterpret_cast.

Eventually the behavior is well defined if and only if the modified object representation is a valid object representation for the destination object.

Yet, due to endianness issues, floating point representation,... the object representation might be legal but the obtained value might not be the expected one.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

c++ - Avoiding UB while reading binary data from std::ifstream - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)