Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade google.golang.org/grpc to 1.35.0 #2771

Closed
yurishkuro opened this issue Feb 2, 2021 · 23 comments · Fixed by #3056
Closed

Upgrade google.golang.org/grpc to 1.35.0 #2771

yurishkuro opened this issue Feb 2, 2021 · 23 comments · Fixed by #3056
Labels
good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@yurishkuro
Copy link
Member

The dependabot PR #2765 ran into compatibility issue, needs manual upgrade

@yurishkuro yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners labels Feb 2, 2021
@yurishkuro
Copy link
Member Author

We may also want to include github.com/golang/protobuf -> 1.4.3 which also failed in #2774 due to deprecation. It will require regenerating proto types, which probably requires upgrading https://github.com/jaegertracing/docker-protobuf

@albertteoh
Copy link
Contributor

albertteoh commented Feb 9, 2021

Had a look at this. I was able to fix the initial undefined: manual.GenerateAndRegisterManualResolver error with the following diff to cmd/agent/app/reporter/grpc/builder.go:

-                       r, _ := manual.GenerateAndRegisterManualResolver()
+                       // Generate a random scheme
+                       scheme := strconv.FormatInt(time.Now().UnixNano(), 36)
+                       r := manual.NewBuilderWithScheme(scheme)
+                       dialOptions = append(dialOptions, grpc.WithResolvers(r))

After that, the following test fails upon upgrading google.golang.org/grpc to 1.35.0:

--- FAIL: TestTraceSpanIDMarshalProto (0.00s)
    --- FAIL: TestTraceSpanIDMarshalProto/protobuf (0.00s)
panic: invalid Go type model.TraceID for field jaeger.api_v2.SpanRef.trace_id [recovered]
	panic: invalid Go type model.TraceID for field jaeger.api_v2.SpanRef.trace_id

I think the panic is due to lack of support for marshalling structs like model.TraceID to bytes. That is, the type switch determines our target type is bytes, but the source type of struct is not supported.

Tried adding || t.Kind() == reflect.Struct || t.Kind() == reflect.Uint64 (TraceID and SpanID respectively) to get past this point.
I finally reached a point where we require a coder that knows how to un/marshal the source type (struct or uint64), but only string and []byte are supported. I can see a uint64 codec, so that should be solvable, but for struct, it's almost as if we need to recursively call Marshal until we reach primitive types.

Anyone have any ideas on how to solve this?

@jpkrohling
Copy link
Contributor

Anyone have any ideas on how to solve this?

@pavolloffay and I had similar problems in the past when trying to upgrade to 1.30, and I reckon Pavol did have a solution, but I can't remember exactly what it was. A change here would also imply a change to the protobuf image that we use to generate the IDL code. See:

#2443
jaegertracing/docker-protobuf#17

From what I remember, @bogdandrutu uses our protobuf as the base for OpenTelemetry's and recently rebased theirs. We could probably use that as reference.

@yurishkuro
Copy link
Member Author

We're able to use custom types because of gogo. Unless gogo stopped supporting grpc completely, it probably requires simultaneous upgrade, because it looks like grpc serialization goes into normal proto instead of gogo.

@joe-elliott
Copy link
Member

I was able to update the jaeger proto build image to use the latest gogo proto gen (v1.3.2), but I'm still running into this issue after regenerating all the proto.

I will continue to hack on this as I have time.

@joe-elliott
Copy link
Member

joe-elliott commented Feb 12, 2021

So I can get tests to pass, but that's about the end of the good news. There appears to be some incompatibilities between gogo proto, and github.com/golang/protobuf 1.4.x that are causing our problems. Here are a few threads that reference similar issues:

gogo/protobuf#686
gogo/protobuf#678

So I can get tests to pass by forcing GRPC to use gogoproto for encoding/decoding like this:

master...joe-elliott:upgrade-grpc

These tests passing/failing is independent of which version of gogoproto I use to generate the *.pb.go files. I've tried the current Jaeger image, my updated one and the otel one.

What I can't figure out is why this used to work or how it is supposed to work. I'm guessing it's related to these RegisterType methods: https://github.com/jaegertracing/jaeger/blob/master/model/model.pb.go#L709. Perhaps they are broken with github.com/golang/protobuf/proto 1.4.x?

Additionally, github.com/golang/protobuf/proto 1.4.x is actually now a facade that builds some complex reflection interfaces around the gogoproto type and eventually uses google.golang.org/protobuf/proto 1.25.x under the hood. Somewhere in all of this the gogoproto type is lost and it causes the customtype marshalling/unmarshalling to fail.

It also appears that gogoproto is looking for new ownership: gogo/protobuf#691. The correct path forward may be to drop any gogoproto custom functionality and use the standard tooling.

@yurishkuro
Copy link
Member Author

Wow, bad news about gogo. I tend to agree with your proposal that we need to move off of it.

The two biggest benefits we got from gogo are:

  1. Custom type mapping for IDs. In thrift the IDs were explicitly encoded as ints for efficiency, resulting in somewhat ugly traceIdLow/High fields. In proto the IDs are bytes, but byte slices require extra allocations. By using custom types we were able to keep efficient memory representation with ints and using bytes in proto. Does the official proto have any support for custom types?
  2. Non-nullable fields were also very helpful in reducing the amount of allocations, especially when using vectors of structs like []Tag. In the default proto that would always be generated as []*Tag, which is just stupid. Did the new proto improve in that area?

@jpkrohling
Copy link
Contributor

By using custom types we were able to keep efficient memory representation with ints and using bytes in proto. Does the official proto have any support for custom types?

Do you still have the benchmark comparing the two cases?

@yurishkuro
Copy link
Member Author

no, I don't recall committing it

@jpkrohling
Copy link
Contributor

Would be good to create a benchmark for it then, as the advantages might not be enough to offset the dependency.

@joe-elliott
Copy link
Member

joe-elliott commented Feb 17, 2021

It seems that moving to protoc-gen-go is going to be problematic in other ways. Regenerating proto and just trying to run tests in ./model is pretty grim.

$ go test
# github.com/jaegertracing/jaeger/model [github.com/jaegertracing/jaeger/model.test]
./keyvalue.go:159:11: kv.Compare undefined (type *KeyValue has no field or method Compare)
./keyvalue.go:191:13: kvs[i].Equal undefined (type KeyValue has no field or method Equal)
./process.go:29:44: cannot use typedTags (type KeyValues) as type []*KeyValue in field value
./process.go:37:18: cannot convert p.Tags (type []*KeyValue) to type KeyValues
./process.go:45:18: cannot convert p.Tags (type []*KeyValue) to type KeyValues
./sort.go:51:23: s[i].Spans[0].TraceID undefined (type *Span has no field or method TraceID, but does have TraceId)
./sort.go:51:51: s[j].Spans[0].TraceID undefined (type *Span has no field or method TraceID, but does have TraceId)
./sort.go:67:56: s[i].SpanID undefined (type *Span has no field or method SpanID, but does have SpanId)
./sort.go:67:70: s[j].SpanID undefined (type *Span has no field or method SpanID, but does have SpanId)
./sort.go:80:15: cannot use span.Tags (type []*KeyValue) as type []KeyValue in argument to sortTags
./sort.go:80:15: too many errors

Issues

  • protoc-gen-go does not generate methods like Compare() or Equal(). The standard proto package does have an equal method which will hopefully act as a drop in replacement. We might have to write custom functionality for any Compares.
  • protoc-gen-go does generate pointers to other messages and I can't find a way to prevent this. It does not generate pointers to base types like string, int32, etc.

Here's a diff showing the proto generated using the latest tooling:
master...joe-elliott:upgrade-grpc2

I also had to make some small changes in the ./idl files not shown here.

@yurishkuro
Copy link
Member Author

protoc-gen-go does generate pointers to other messages and I can't find a way to prevent this

And it doesn't look like this will improve anytime soon: golang/protobuf#1225

@joe-elliott
Copy link
Member

joe-elliott commented Feb 25, 2021

So I've gone down the path of attempting to modify the codebase to handle the new proto and it is turning into an enormous undertaking. There are field name changes, value to pointer types, lack of Equal/Compare methods, loss of custom gogo types (timestamp, descriptor, duration), and probably a bunch of things I haven't found yet.

I think the only paths forward are:

  • Be patient and see if gogoproto is brought up to date with the latest GRPC packages
  • Write some kind of shim in between the new generated proto and the rest of the codebase to allow it to treat the model similarly to how it did before. Obviously this is likely to come with a performance hit.
  • Force the codec like I did here: master...joe-elliott:upgrade-grpc and hope other subtle bugs don't crop up.

@yurishkuro
Copy link
Member Author

Another alternative I was thinking of - doesn't OTEL Collector also solve this problem somehow? Afaik their internal domain model avoids pointers too.

@joe-elliott
Copy link
Member

They do use gogofaster:

https://github.com/open-telemetry/opentelemetry-collector/blob/e6319ac4c6fce274b8ba2d0e3b8e5eb0fd0d6142/Makefile#L313

but they don't use the "non-nullable" types:

https://github.com/open-telemetry/opentelemetry-collector/blob/e6319ac4c6fce274b8ba2d0e3b8e5eb0fd0d6142/internal/data/protogen/trace/v1/trace.pb.go#L382

Whether or not the code generated by gogoprotobuf works with grpc 1.35 seems dependent on the features that were used.

@yurishkuro
Copy link
Member Author

What does "forcing the codec" do? If it solves the problem that it looks fine to me.

@bogdandrutu
Copy link

They do use gogofaster:

Yes

but they don't use the "non-nullable" types:

We do use non-nullable for bunch of places see https://github.com/open-telemetry/opentelemetry-collector/blob/e6319ac4c6fce274b8ba2d0e3b8e5eb0fd0d6142/proto_patch.sed

Alternative solution is to make a custom grpc marshaler if you have troubles, but you should not have problems because that interface did not change

@yurishkuro
Copy link
Member Author

@bogdandrutu what are your thoughts about this ticket? Is OTEL planning to stay with gogo, despite gogo being unsupported?

@bogdandrutu
Copy link

@yurishkuro one of the reason we actually don't expose gogo public and rather developed our own internal format pdata was that will allow us to change the proto compiler, and maybe one day to write our own "lazy" marshaler/unmarshaler that will parse and write directly to the data stream.

We will stick with gogo for at least the next 6+ months as our internal representation (behind the pdata). Also you have to remember that the interface between the proto object and gRPC is a simple one (since gRPC does not depend directly on proto) so we can always re-implement (and you can do the same) that one.

@yurishkuro
Copy link
Member Author

Thanks. We also do not expose gogo (well, almost, it does leak into grpc-plugin API), but right now it seems we're unable to upgrade to newer gRPC versions (unless I misunderstood Joe's comment about the custom codec).

@bogdandrutu
Copy link

bogdandrutu commented Feb 26, 2021

I remember that you are very behind with the generated protos https://github.com/jaegertracing/jaeger/blob/master/model/model.pb.go#L31 this looks like you are using a very old protoc and gogo library in your docker. We are on a newer version https://github.com/open-telemetry/opentelemetry-collector/blob/main/internal/data/protogen/trace/v1/trace.pb.go#L30 (still behind the official proto)

@joe-elliott
Copy link
Member

joe-elliott commented Feb 26, 2021

Latest GRPC/protobuf is seemingly incompatible with only some features of gogoproto. The one that we stumbled upon was custom marshalling/unmarshalling, but in this comment I link to a few others:

#2771 (comment)

I remember that you are very behind with the generated protos

We run into these issues whether we use the Jaeger proto builder or the latest otel builder. The below branch uses the latest otel protobuf builder and the original issues called out in this thread persist. I have not been able to dig into the exact mechanics, but grpc latest does not correctly "find" some of the gogo proto code and therefore fails on unmarshal/marshal.

However, I have been able to reduce the changes quite a bit. If we use this function encoding.RegisterCodec then tests pass and the impact to the codebase is minimal. Note that the encoding package is marked "experimental".

On this branch all tests pass: master...joe-elliott:upgrade-grpc

  • Uses otel-builder 0.2.1 for latest tooling
  • Copies in removed method generateAndRegisterManualResolver()
  • Uses experimental encoding.RegisterCodec to get grpc 1.35 to use gogo proto for marshalling/unmarshalling
  • Upgrades to protobuf 1.4.2 and grpc 1.35.0
  • No changes to idl

@siavashs
Copy link

siavashs commented May 2, 2021

I'm hitting the following error which could be related to this issue:

invalid Go type model.TraceID for field jaeger.api_v2.GetTraceRequest.trace_id

When trying to use the api_v2 package:

import (
	"github.com/jaegertracing/jaeger/model"
	api "github.com/jaegertracing/jaeger/proto-gen/api_v2"
        ...
)

type Reader struct {
	client api.QueryServiceClient
}

func (r *Reader) GetTrace(ctx context.Context, traceID model.TraceID) (*model.Trace, error) {
	req := api.GetTraceRequest{TraceID: traceID}
        stream, error := r.cient.GetTrace(ctx, &req) // triggers the type error
        ...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants