I am writing a raw protobuf message with the library com.google.protobuf
, leveraging UnknownFieldSet
and I am encountering a problem when encoding strings as they sometimes break the result.
I want to encode:
1 -> ["stuff", "stuff"]
2 -> ["stuff","android.microphone","stuff"]
which I figured can be done using the following code:
import com.google.protobuf.{ByteString, UnknownFieldSet}
// ....
def doEncoding() : UnknownFieldSet = {
UnknownFieldSet.newBuilder()
.addField(1,UnknownFieldSet.Field.newBuilder()
.addLengthDelimited(ByteString.copyFromUtf8("stuff"))
.addLengthDelimited(ByteString.copyFromUtf8("stuff"))
.build())
.addField(2,UnknownFieldSet.Field.newBuilder()
.addLengthDelimited(ByteString.copyFromUtf8("stuff"))
.addLengthDelimited(ByteString.copyFromUtf8("android.microphone"))
.addLengthDelimited(ByteString.copyFromUtf8("stuff"))
.build())
.build()
}
However, dumping the resulting bytes into a file using .toByteArray
on the UnknownFieldSet
and then reading the data using protod results in an unexpected data structure:
[0a] 1 string: (5) stuff (73 74 75 66 66)
[0a] 1 string: (5) stuff (73 74 75 66 66)
[12] 2 string: (5) stuff (73 74 75 66 66)
[12] 2 string: (18) android.microphone
[61] 12 fixed64/double: 7867336003066946670 (0x6d2e64696f72646e) (8.381649661287266e+217)
[69] 13 fixed64/double: 7308901739622527587 (0x656e6f68706f7263) (3.9466026192472086e+180)
[12] 2 string: (5) stuff (73 74 75 66 66)
The first array is fine, but the second is broken and contains data values never entered.
What am I doing wrong when adding the string to the raw protobuf?
I am writing a raw protobuf message with the library com.google.protobuf
, leveraging UnknownFieldSet
and I am encountering a problem when encoding strings as they sometimes break the result.
I want to encode:
1 -> ["stuff", "stuff"]
2 -> ["stuff","android.microphone","stuff"]
which I figured can be done using the following code:
import com.google.protobuf.{ByteString, UnknownFieldSet}
// ....
def doEncoding() : UnknownFieldSet = {
UnknownFieldSet.newBuilder()
.addField(1,UnknownFieldSet.Field.newBuilder()
.addLengthDelimited(ByteString.copyFromUtf8("stuff"))
.addLengthDelimited(ByteString.copyFromUtf8("stuff"))
.build())
.addField(2,UnknownFieldSet.Field.newBuilder()
.addLengthDelimited(ByteString.copyFromUtf8("stuff"))
.addLengthDelimited(ByteString.copyFromUtf8("android.microphone"))
.addLengthDelimited(ByteString.copyFromUtf8("stuff"))
.build())
.build()
}
However, dumping the resulting bytes into a file using .toByteArray
on the UnknownFieldSet
and then reading the data using protod results in an unexpected data structure:
[0a] 1 string: (5) stuff (73 74 75 66 66)
[0a] 1 string: (5) stuff (73 74 75 66 66)
[12] 2 string: (5) stuff (73 74 75 66 66)
[12] 2 string: (18) android.microphone
[61] 12 fixed64/double: 7867336003066946670 (0x6d2e64696f72646e) (8.381649661287266e+217)
[69] 13 fixed64/double: 7308901739622527587 (0x656e6f68706f7263) (3.9466026192472086e+180)
[12] 2 string: (5) stuff (73 74 75 66 66)
The first array is fine, but the second is broken and contains data values never entered.
What am I doing wrong when adding the string to the raw protobuf?
Share Improve this question asked Feb 15 at 14:07 SimSim 4,1844 gold badges41 silver badges81 bronze badges1 Answer
Reset to default 2This is because Protobuf messages can be ambiguous and Protobufs rely on a schema (Protobuf) to disambiguate. Corollary: Multiple Protobuf schema may produce the same Protobuf message.
message X {
string s = 1;
}
Using your preferred Protobuf SDK, the following message:
X{
S: "android.microphone",
}
Marshals to (hex-encoded):
0a12616e64726f69642e6d6963726f70686f6e65
And using protoc
to decode the message without a schema:
printf "0a12616e64726f69642e6d6963726f70686f6e65" \
| xxd -r -p \
| protoc --decode_raw
1 {
12: 0x6d2e64696f72646e
13: 0x656e6f68706f7263
}
These values match your fixed64/double
values.
Using protoc
with the schema, decodes the string correctly:
protoc --decode=X x.proto
x: "android.microphone"
You can corroborate this with Protobuf Decoder too using the hex-encoded output above.
This is unavoidable without a schema.