proto: revert UTF-8 validation for proto2 #628

dsnet · 2018-06-05T23:42:12Z

The proto specification officially says that proto2 and proto3 strings should be
validated, but pragmatically, compliance with the spec has been poor.
For example, the Go implementation did not validate either and added strict
validation recently to be compliant. However, this caused signficant breakage.

Cases of breakage should change the proto field type from string to the bytes type.
However, this is not always possible, when the field is part of the exposed API.
This tends to be the case for proto2, where some other notable language
implementations (like C++) do not validate proto2 for valid UTF-8.
However, since most language implementations do validate for UTF-8 in proto3,
we keep that behavior.

Making this change for Go is a little tricky since each field does not necessarily
know whether it is operating under the proto2 or proto3 syntax. Thus, we modify
the generator to emit a "proto3" struct field tag for all fields in proto3.
The implications of this change is that people will need to regenerate their
proto files to have UTF-8 validation.

We expand UTF-8 validation tests to ensure this works for the cross-product of
(proto2, proto3) and (scalar, vector, oneof, and maps) fields with strings.

Fixes #622

neild

Should we export errInvalidUTF8?

neild · 2018-06-06T15:33:53Z

protoc-gen-go/generator/generator.go

@@ -1573,7 +1573,9 @@ func (g *Generator) goTag(message *Descriptor, field *descriptor.FieldDescriptor
 		if *field.Type == descriptor.FieldDescriptorProto_TYPE_BYTES {
 			name += ",proto3"
 		}
-
+		if *field.Type == descriptor.FieldDescriptorProto_TYPE_STRING {
+			name += ",utf8"


Should this use the "proto3" tag instead? We may discover other cases in the future where proto2/proto3 fields need different handling, and it'll be more efficient to have a single tag in that case.

If so, should we preemptively tag all proto3 fields, even if they currently have no semantic difference?

The proto3 tag was already added to distinguish the semantic difference for []byte{} vs []byte(nil) when it comes to the bytes type.

Looking through the code history, the attribute was only generated on the bytes type out of fear for binary size increase. If we want to go with proto3, should we just generate it for all fields? That's an increase of 7 bytes for every field. Alternatively, we can only generate it for the string field.

If we're worried about binary size increase that argues for a proto3 tag rather than utf8, to avoid the possibility of redundant tags in the future.

I don't really have a strong opinion on every field vs. string, although there are a couple arguments for the former: It avoids any possibility of needing to do this again in the future, and I think all the cases where 7 bytes per field might conceivably be an issue are proto2 anyway.

I enabled it unilaterally for all fields.

dsnet · 2018-06-06T17:53:20Z

Let's avoid exporting the error for now.

Several thoughts:

As it stands the current logic fails-fast when there is invalid UTF-8. In order for a distinguishable error to be useful, they would still need to fully marshal or unmarshal the message.
We probably wouldn't want a sentinel error. A useful piece of information is the exact field that had UTF-8 error.
If we exported a new type, then we have to think carefully about what happens message has both required fields not set and invalid UTF-8. You would probably want to report that both errors occurred. Here's a case where your error tagging idea would be really useful.

The proto specification officially says that proto2 and proto3 strings should be validated, but pragmatically, compliance with the spec has been poor. For example, the Go implementation did not validate either and added strict validation recently to be compliant. However, this caused signficant breakage. Cases of breakage should change the proto field type from string to the bytes type. However, this is not always possible, when the field is part of the exposed API. This tends to be the case for proto2, where some other notable language implementations (like C++) do not validate proto2 for valid UTF-8. However, since most language implementations do validate for UTF-8 in proto3, we keep that behavior. Making this change for Go is a little tricky since each field does not necessarily know whether it is operating under the proto2 or proto3 syntax. Thus, we modify the generator to emit a "proto3" struct field tag for all fields in proto3. The implications of this change is that people will need to regenerate their proto files to have UTF-8 validation. We expand UTF-8 validation tests to ensure this works for the cross-product of (proto2, proto3) and (scalar, vector, oneof, and maps) fields with strings. Fixes #622

This update uses the latest version of the protoc Go plugin, which reverted a change in UTF-8 validation for proto2 (and added fields for proto3). This change removed validation of UTF-8 in proto2 and added a new field in proto3 that signals validation. Relevant PR: golang/protobuf#628

dsnet requested a review from neild June 5, 2018 23:42

dsnet force-pushed the invalid-utf8-error2 branch 2 times, most recently from 3c62020 to 582a39e Compare June 5, 2018 23:50

neild reviewed Jun 6, 2018

View reviewed changes

dsnet force-pushed the invalid-utf8-error2 branch from 582a39e to 28c8b1f Compare June 6, 2018 19:54

neild approved these changes Jun 6, 2018

View reviewed changes

dsnet merged commit 05f48f4 into master Jun 6, 2018

dsnet deleted the invalid-utf8-error2 branch June 6, 2018 20:26

This was referenced Jun 6, 2018

datastore: []byte() cannot be saved or loaded if it contains invalid UTF-8 bytes golang/appengine#143

Closed

datastore: invalid datastore.Property_Meaning: ENTITY_PROTO golang/appengine#140

Closed

Fixdatastore golang/appengine#141

Closed

menghanl mentioned this pull request Jun 8, 2018

Revert "status: handle invalid utf-8 characters" grpc/grpc-go#2127

Merged

sbuss mentioned this pull request Jul 26, 2018

Create new release #659

Closed

golang locked and limited conversation to collaborators Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proto: revert UTF-8 validation for proto2 #628

proto: revert UTF-8 validation for proto2 #628

dsnet commented Jun 5, 2018 •

edited

Loading

neild left a comment

neild Jun 6, 2018

dsnet Jun 6, 2018

neild Jun 6, 2018

dsnet Jun 6, 2018

dsnet commented Jun 6, 2018

proto: revert UTF-8 validation for proto2 #628

proto: revert UTF-8 validation for proto2 #628

Conversation

dsnet commented Jun 5, 2018 • edited Loading

neild left a comment

Choose a reason for hiding this comment

neild Jun 6, 2018

Choose a reason for hiding this comment

dsnet Jun 6, 2018

Choose a reason for hiding this comment

neild Jun 6, 2018

Choose a reason for hiding this comment

dsnet Jun 6, 2018

Choose a reason for hiding this comment

dsnet commented Jun 6, 2018

dsnet commented Jun 5, 2018 •

edited

Loading