-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanups to OpenTelemetry tracing. #9
Conversation
Thanks for the pull request!! To ensure quality review, Couchbase employs a code review system based on Gerrit to manage the workflow of changes in addition to tracking our contributor agreements. To get this change in and collaborate in code review, please register on Gerrit and accept our CLA. The easiest way to do this is to follow the link below, sign in with your GitHub account and then follow through the steps provided on that page to sign an 'Individual' agreement: http://review.couchbase.org/#/settings/new-agreement. Keep in mind that the emails we are seeing on the commits are: Note: Please contact us if you have any issues registering with Gerrit! If you have not signed our CLA within 7 days, the Pull Request will be automatically closed. ::SDKBOT/PR:no_cla |
Signed the CLA - assuming the PR will autosync to gerrit at some point but let me know if there's any other steps. |
*/ | ||
public static OpenTelemetryRequestTracer wrap(final Tracer tracer) { | ||
return new OpenTelemetryRequestTracer(tracer); | ||
public static OpenTelemetryRequestTracer wrap(final OpenTelemetry openTelemetry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was always under the impression (maybe this is legacy knowledge from OpenTracing) that there is only one global tracer for each app, so that we would just whatever tracer the user has configured. Does this mean that with Otel you'll have one tracer instance for each library instrumented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup OTel is a bit different, the "Tracer" from OpenTracing is called "TracerProvider" in OTel. Each library indeed has its own Tracer with information about that library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes sense then!
private OpenTelemetryRequestTracer(Tracer tracer) { | ||
this.tracer = tracer; | ||
private OpenTelemetryRequestTracer(OpenTelemetry openTelemetry) { | ||
this.tracer = openTelemetry.getTracer("com.couchbase.client"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a scheme we should follow on the naming (I saw an example around otel.contrib?). Also will this show up in the actual UIs of the servers? If so maybe we should do com.couchbase.client. like com.couchbase.client.java, com.couchbase.client.dotnet etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't think I've seen a guidelines aside from what's here
Putting java here definitely makes sense, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added java
but now I'm wondering whether any of jvm
, java-core
, java-core-io
are more appropriate since this isn't related to the java-client
artifact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you set the default to com.couchbase.client.jvm? I'm going to add subsequent options so that each higher level sdk (i.e. java, kotlin, scala) can override that "jvm" bit so we have a more specific namespace if we can get it but a sane default if not.
span -> span | ||
.hasName("get") | ||
.hasParentSpanId(parent.getSpanContext().getSpanId()) | ||
.hasKind(SpanKind.INTERNAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we send SpanKind.CLIENT on all of the couchbase-related spans that are created in the SDK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e. also on the dispatch_to_server)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're not really allowed to have nested CLIENT
spans because the definition in the spec says a single CLIENT
span corresponds to a single SERVER
span. In our instrumentation, we generally suppress transport spans, so for example in AWS SDK instrumentation, we generate a CLIENT
span for the SDK call, but no span for the Apache HTTP Client request itself. In this case, I think that would be similar to suppressing the dispatch_to_server
span. However, it's not entirely bad modeling to have dispatch_to_server
as an INTERNAL
span either, but we'd generally only expect the "API Method" to be a CLIENT
span - imagine it as the name on client and server should be very similar. Ideally some day Couchbase itself has OpenTelemetry server instrumentation too and would have a get
span with type SERVER
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more question: how do you deal with "multiple" requests? So for example we have "get from replicas" which will internally send N requests to the active and all replicas. Would the outer parent span that covers all the individual requests also be a CLIENT? or really only ever the individual requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't really settled on a great way to model this, this spec issue is related.
open-telemetry/opentelemetry-specification#1360
If one of the replicas fails, does the entire request fail? I suppose not since we're expecting one replica to succeed - in that case I'd consider this to be a single CLIENT span, how many requests are made in the background is still an implementation detail. Tracing backends generally assume CLIENT span failures are indicative of a problem so it's good for those to be 1:1 if possible.
It can still be important for debugging though, so one idea we've had while discussing that issue is possibly modeling as events, modeling as INTERNAL spans, or maybe allowing nested CLIENT spans but we haven't come to a great conclusion yet unfortunately.
@@ -78,9 +78,7 @@ public void addEvent(String name, Instant timestamp) { | |||
|
|||
@Override | |||
public void end() { | |||
try (Scope scope = span.makeCurrent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the span is completed on a different thread than it was created, it doesn't need to made current again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just to call .end
If in this method, if we had a call like grpcClient.executeRpc
, then yes it would need to be made current so the gRPC client picked it up as a parent. Making a span current is only to allow implicit propagation between two different libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that makes sense, thanks
@anuraaga thank you very much for your PR! I've posted a couple comments because I have some follow up questions if you don't mind. |
Hey @anuraaga, It looks like the email that this commit was created with (aaag@amazo.jp) isn't linked to your Gerrit account. You can add it here: http://review.couchbase.org/settings/#EmailAddresses and then this PR should be pulled to Gerrit. Cheers, Brett |
bd562f5
to
4b5f194
Compare
Your changes (commit: 4b5f194) have been pushed to the Couchbase Review Site: ::SDKBOT/PR:created |
Your changes (commit: d47105f) have been pushed to the Couchbase Review Site: Note: As your pull request contains multiple commits, we have performed an automatic squash of these commits into a single change-set. If this is not the desired behaviour, please consider submitting a pull request per discreet feature. ::SDKBOT/PR:updated |
@brett19 Thanks for the advice! As this is under individual CLA I needed to change my email address now looks like it's good :) |
if (parent != null) { | ||
spanBuilder.setParent(Context.current().with(castSpan(parent))); | ||
} else { | ||
spanBuilder.setNoParent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually one more question here too: I assume you are refering to the implicit propagation through active and thread locals? What about where we are working under async / reactive environments where the one in the thread local might not actually be the "real" parent?
Btw, not sure if you noticed, we actually have an explicit propagation mechanism. On every request the user makes, in the option block, you can provide the parent explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for async environments, we still expect the context to be propagated through callbacks, for example wiring up using something like
Executor contextExecutor = Context.current().wrap(Runnable::run);
asyncClient.execute(request).thenComposeAsync(response -> asyncClient.execute(request2), contextExecutor)
...;
Or registering the executor in a dagger graph, or as a reactive subscriber, or maybe use reactive hooks via some instrumentation like https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/reactor-3.1/library, or using the java agent which also does automatic context propagation in many cases.
The reality is very few libraries support explicitly passing in a context like this couchbase one does, so propagating the context into threadlocals across async bounds is basically a requirement in these environments, e.g. if you're using gRPC or really any other library you need to be propagating context. On the bright side, this means that a couchbase user doesn't have to worry about couchbase specifically, they can treat it as any other client in their app and expect it to use the implicit parent when there's no explicit one.
Of course, it's up to you whether this is something you want to support - on the flip side, you could just have a couchbase sdk policy saying "always pass in the parent, we don't use implicit propagation at all". I wouldn't recommend it though since it'd be a jarring experience for users compared to other libraries.
@anuraaga thanks again, I have a couple more follow ups but then we should be good to go. |
Your changes (commit: 84e0767) have been pushed to the Couchbase Review Site: Note: As your pull request contains multiple commits, we have performed an automatic squash of these commits into a single change-set. If this is not the desired behaviour, please consider submitting a pull request per discreet feature. ::SDKBOT/PR:updated |
@daschl Sorry for the delay, updated the tracer name to be |
@anuraaga yes thanks very much - I'll complete the rest of it over at gerrit. If you are okay I'll amend your commit with a jira version etc so it aligns with our commit messages. thanks much for your contribution again! |
@anuraaga can you close the ticket here? ( I can't unfortunately ) |
@daschl Do you mean closing this PR? |
Yes, since it's already in gerrit. |
Got it. I might recommend either 1) Gerrit sync closes PR too or 2) Stop Gerrit since Github PR UI has made huge progress since back in the day. Not my project though ;) |
@anuraaga couchbase exclusively uses gerrit for all its projects - we just use github as mirrors since people are more comfortable with it. |
@daschl Yup - I'm just suggesting it's probably time to challenge that given how things have changed these past few years. I personally won't be coming back to this repo since I believe in contributor-first (Gerrit + sync + PR close is contributor-second). Just food for thought, please digest it since I won't be commenting anymore. Will be happy to continue to consume the binaries though since we have committed through open-telemetry/opentelemetry-java-instrumentation#2524 (via GitHub -- pardon my last snark ;). Cheers! |
When selecting a query error to propagate as exception, prefer non-retriable errors. Change-Id: I3e27530e94e462dda27e62161d5c114861497a1a Reviewed-on: https://review.couchbase.org/c/couchbase-jvm-clients/+/216262 Tested-by: Build Bot <[email protected]> Reviewed-by: Graham Pople <[email protected]>
Hello, I'm an OpenTelemetry maintainer - thanks for using OpenTelemetry for tracing in this project :)
I've applied some cleanups to this project to match some of the guidelines we have in https://github.com/open-telemetry/opentelemetry-java-instrumentation
OpenTelemetry
instance to instrumentation instead ofTracer
. Tracer name should be determined by the library, not by the usersetNoParent
when couchbase didn't provide a parent span. Couchbase spans still need to parent to, e.g. a spring-boot server spanmakeCurrent
- it is to make a span active in a threadlocal for a block of code, but immediately calling close is effectively an expensive no-op, and it's not needed forSpan.end()
either. Since you propagate parentRequestSpan
within the SDK, you don't need to ever usemakeCurrent
I think.One big remaining issue seems to be the span kind - couchbase spans corresponding to a user operation should be
CLIENT
, notINTERNAL
. But I think that would involve a mapping fromoperationName
to kind which I'm not familiar enough to solve myself.