-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proto: revert UTF-8 validation for proto2 #628
Conversation
3c62020
to
582a39e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we export errInvalidUTF8?
protoc-gen-go/generator/generator.go
Outdated
@@ -1573,7 +1573,9 @@ func (g *Generator) goTag(message *Descriptor, field *descriptor.FieldDescriptor | |||
if *field.Type == descriptor.FieldDescriptorProto_TYPE_BYTES { | |||
name += ",proto3" | |||
} | |||
|
|||
if *field.Type == descriptor.FieldDescriptorProto_TYPE_STRING { | |||
name += ",utf8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use the "proto3" tag instead? We may discover other cases in the future where proto2/proto3 fields need different handling, and it'll be more efficient to have a single tag in that case.
If so, should we preemptively tag all proto3 fields, even if they currently have no semantic difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proto3
tag was already added to distinguish the semantic difference for []byte{}
vs []byte(nil)
when it comes to the bytes type.
Looking through the code history, the attribute was only generated on the bytes type out of fear for binary size increase. If we want to go with proto3
, should we just generate it for all fields? That's an increase of 7 bytes for every field. Alternatively, we can only generate it for the string
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're worried about binary size increase that argues for a proto3 tag rather than utf8, to avoid the possibility of redundant tags in the future.
I don't really have a strong opinion on every field vs. string, although there are a couple arguments for the former: It avoids any possibility of needing to do this again in the future, and I think all the cases where 7 bytes per field might conceivably be an issue are proto2 anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I enabled it unilaterally for all fields.
Let's avoid exporting the error for now. Several thoughts:
|
The proto specification officially says that proto2 and proto3 strings should be validated, but pragmatically, compliance with the spec has been poor. For example, the Go implementation did not validate either and added strict validation recently to be compliant. However, this caused signficant breakage. Cases of breakage should change the proto field type from string to the bytes type. However, this is not always possible, when the field is part of the exposed API. This tends to be the case for proto2, where some other notable language implementations (like C++) do not validate proto2 for valid UTF-8. However, since most language implementations do validate for UTF-8 in proto3, we keep that behavior. Making this change for Go is a little tricky since each field does not necessarily know whether it is operating under the proto2 or proto3 syntax. Thus, we modify the generator to emit a "proto3" struct field tag for all fields in proto3. The implications of this change is that people will need to regenerate their proto files to have UTF-8 validation. We expand UTF-8 validation tests to ensure this works for the cross-product of (proto2, proto3) and (scalar, vector, oneof, and maps) fields with strings. Fixes #622
582a39e
to
28c8b1f
Compare
This update uses the latest version of the protoc Go plugin, which reverted a change in UTF-8 validation for proto2 (and added fields for proto3). This change removed validation of UTF-8 in proto2 and added a new field in proto3 that signals validation. Relevant PR: golang/protobuf#628
This update uses the latest version of the protoc Go plugin, which reverted a change in UTF-8 validation for proto2 (and added fields for proto3). This change removed validation of UTF-8 in proto2 and added a new field in proto3 that signals validation. Relevant PR: golang/protobuf#628
This update uses the latest version of the protoc Go plugin, which reverted a change in UTF-8 validation for proto2 (and added fields for proto3). This change removed validation of UTF-8 in proto2 and added a new field in proto3 that signals validation. Relevant PR: golang/protobuf#628
The proto specification officially says that proto2 and proto3 strings should be
validated, but pragmatically, compliance with the spec has been poor.
For example, the Go implementation did not validate either and added strict
validation recently to be compliant. However, this caused signficant breakage.
Cases of breakage should change the proto field type from string to the bytes type.
However, this is not always possible, when the field is part of the exposed API.
This tends to be the case for proto2, where some other notable language
implementations (like C++) do not validate proto2 for valid UTF-8.
However, since most language implementations do validate for UTF-8 in proto3,
we keep that behavior.
Making this change for Go is a little tricky since each field does not necessarily
know whether it is operating under the proto2 or proto3 syntax. Thus, we modify
the generator to emit a "proto3" struct field tag for all fields in proto3.
The implications of this change is that people will need to regenerate their
proto files to have UTF-8 validation.
We expand UTF-8 validation tests to ensure this works for the cross-product of
(proto2, proto3) and (scalar, vector, oneof, and maps) fields with strings.
Fixes #622