Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add activities to Blazor Server for distributed tracing #29846

Open
Tracked by #5461
rynowak opened this issue Feb 2, 2021 · 36 comments
Open
Tracked by #5461

Add activities to Blazor Server for distributed tracing #29846

rynowak opened this issue Feb 2, 2021 · 36 comments
Labels
affected-few This issue impacts only small number of customers area-blazor Includes: Blazor, Razor Components enhancement This issue represents an ask for new feature or an enhancement to an existing one feature-blazor-server feature-observability Pillar: Dev Experience Priority:1 Work that is critical for the release, but we could probably ship without severity-major This label is used by an internal tool triaged

Comments

@rynowak
Copy link
Member

rynowak commented Feb 2, 2021

Describe the bug

The long-lived circuits of Blazor server make distributed tracing not work as expected.

Since each circuit is effectively a long-lived request ... a lot of activity (pun intended) ends up getting traced under the same activity. Basically all outgoing requests/traces from inside of a Blazor server circuit (browser tab) will have the same parent span, which causes them to all get grouped together as a single "root cause" in tracing UIs.

To Reproduce

Create a Blazor server application. Use a button to fire off HTTP requests to another ASP.NET Core endpoint. Hook all of this up to your favorite distributed tracing system and enjoy.

All of the operations you start from a Blazor circuit will end inside the same single logical operation from a tracing POV.

image

@rynowak
Copy link
Member Author

rynowak commented Feb 2, 2021

/cc @davidfowl @noahfalk

@javiercn javiercn added the area-blazor Includes: Blazor, Razor Components label Feb 2, 2021
@javiercn
Copy link
Member

javiercn commented Feb 2, 2021

@rynowak thanks for reporting for duty.

I imagine this is not only true for Blazor but for Websockets/SignalR in general.

Is there something we can do here to "disconnect" the distributed tracing?

Is it really correct if we somehow disconnect the Blazor Server session from the original request that started it?

@javiercn javiercn added this to the Next sprint planning milestone Feb 2, 2021
@ghost
Copy link

ghost commented Feb 2, 2021

Thanks for contacting us.
We're moving this issue to the Next sprint planning milestone for future evaluation / consideration. We will evaluate the request when we are planning the work for the next milestone. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

@javiercn javiercn added the enhancement This issue represents an ask for new feature or an enhancement to an existing one label Feb 2, 2021
@rynowak
Copy link
Member Author

rynowak commented Feb 2, 2021

Is there something we can do here to "disconnect" the distributed tracing?

I'm currently working around this by setting Activity.Current = null, and it seems to do the trick. I would imagine that a framework level fix would create a new root activity for each operation that comes through the circuit handler. You can use tags/baggage to attach things like a circuit ID so that correlation is possible.

So it is can be controlled ....

Is it really correct if we somehow disconnect the Blazor Server session from the original request that started it?

That's a great question 😁

I think the major downside of the current behavior is that a really really long-lived circuit could include thousands of descendant spans. Think about @danroth27 giving the counter demo for 10 hours as part of a charity livestream. In the worst case this will break downstream tooling, or in the best case just give you data that's hard to work with. I'm really not sure.

I might also be the first person to notice this, so I figured it was worth a discussion 🤷

@javiercn
Copy link
Member

javiercn commented Feb 2, 2021

@rynowak you bring up good points.

I'm not super familiar with the activity id APIs, but In general I believe the fix here is to "detach it" from the original request activity and then to start a new one per received message on the circuit as you suggest, and then tag it with the circuit info

@noahfalk
Copy link
Member

noahfalk commented Feb 3, 2021

cc @tarekgh

Thanks for raising this @rynowak. I'll try to take a little time in the next day or two to repro this and understand where the Activities are currently being created so I have more context. I may also find alternative suggestions on how the issue can be resolved. I imagine as .NET developers use distributed tracing more frequently it will behoove us to have good guidance and understanding of common patterns that show up.

You can use tags/baggage to attach things like a circuit ID so that correlation is possible

Links are a new concept added in the 5.0 release of DiagnosticSource.dll that may also be useful here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/overview.md#links-between-spans

@tarekgh
Copy link
Member

tarekgh commented Feb 3, 2021

Can we get a debugger dump which we can see where the Activity objects get created. That may be faster I guess if possible. I am guessing there is some activities started and never stopped.

@davidfowl
Copy link
Member

It's an issue we need to address in signalr and then we can have a broader discussion about what behavior we want in general.

@noahfalk
Copy link
Member

noahfalk commented Feb 4, 2021

I didn't have any background on how the Blazor app front-end and back-end communicate, but given @davidfowl's comment + a little debugging it appears to be SignalR protocol over WebSocket, is that accurate?

In repro app there was a short series of paired Activity start/stops but then 8th Activity start is when Kestrel receives the HTTP request that initiates the WebSocket and that one is long lived by design. Additional activity start/stop pairs get recorded re-entrantly as the server sends messages to itself when the button is clicked. In terms of what to do about it...

  1. We probably want to define how a W3C trace id should be embedded in a SignalR request and then implement SignalR client and server code to transmit/receive those ids. This would allow us to have traces that start at the front-end and track fine-grained interactions such as clicking a button, similar to what is already possible using more traditional HTTP request/response patterns.
  2. We need to figure out what happens if multiple layers of a protocol stack are sending different trace ids. I haven't seen prior art on this but maybe I haven't looked in the right places yet. I think we have a few options:
  • Avoid the issue by not sending a trace id on the HTTP GET that initiates the WebSocket
  • On the server when the nested SignalR protocol provides a new trace-id we could clear the HTTP level Activity and start the new one. Ideally we'd also restore the HTTP Activity after responding. There is a bug doing that today but I assume it will get fixed or we could work around it by manually setting Activity.Current. There may also be some value in adding a link between the HTTP level activity and the SignalR one.

@danroth27
Copy link
Member

[Blazor Server] appears to be SignalR protocol over WebSocket, is that accurate?

Yup

@davidfowl
Copy link
Member

On the server when the nested SignalR protocol provides a new trace-id we could clear the HTTP level Activity and start the new one. Ideally we'd also restore the HTTP Activity after responding. There is a bug doing that today but I assume it will get fixed or we could work around it by manually setting Activity.Current. There may also be some value in adding a link between the HTTP level activity and the SignalR one.

Related to #18711

cc @BrennanConroy

@davidfowl
Copy link
Member

I'm gonna take a stab at this. I think it's more than the activity though that's related here. I don't think we should be propagating the execution context from the request to the hub execution. It's especially broken for long polling scenarios where there are multiple requests throughout the execution of the hub.

@ericsampson
Copy link

Ryan wrote:

I might also be the first person to notice this, so I figured it was worth a discussion 🤷

I do remember someone else mentioning this in the context of SignalR a while back, I'm not having any luck finding the issue again unfortunately.

@SteveSandersonMS SteveSandersonMS added the area-signalr Includes: SignalR clients and servers label Feb 19, 2021
@SteveSandersonMS SteveSandersonMS changed the title Blazor server takes distributed tracing to strange new places Blazor Server / SignalR / WebSockets takes distributed tracing to strange new places Feb 19, 2021
@SteveSandersonMS
Copy link
Member

SteveSandersonMS commented Feb 19, 2021

I'm not sure how useful it's going to be to have a separate activity ID for every message flowing across the websocket connection. The traces will then just say things like "Within circuit abcdef, following receipt of event 4728, we made HTTP requests to the following URLs...". Does the "event 4728" information help the person reading the trace?

So far it hasn't been a design goal to capture more information about each user activity within a Blazor Server circuit. If this was an area of growing customer demand, we could start pulling more information out of the rendertree and navigation manager so (for example) we could say things like "Within circuit abcdef, while on URL /counter, in reaction to a click event on a button element with class btn counter rendered by CounterComponent, we made HTTP requests to the following URLs...". But I don't personally know if or how this sort of detail can be communicated usefully to an upstream tracing technology. Also we'd have to use something like async locals to correlate actions to their causal events.

Another argument would be that tracing technologies could offer more value by moving beyond thinking in terms of short HTTP round trips and have first-class support for long-running connections (WebSockets in general, not just Blazor Server). I know that would only address the UX concern about how info is presented in the tracing UI, so it's still applicable to think about how Blazor Server can provide more human-readable context information as per the previous paragraph.

@SteveSandersonMS SteveSandersonMS added affected-few This issue impacts only small number of customers severity-major This label is used by an internal tool labels Feb 19, 2021 — with ASP.NET Core Issue Ranking
@noahfalk
Copy link
Member

So far it hasn't been a design goal to capture more information about each user activity within a Blazor Server circuit

I think its fine to decide that this isn't a priority for you right now if you aren't seeing enough signal that customers care. My guess is that customers will care about this in proportion to how often apps using SignalR are sluggish and how easy it is to correlate actions on one machine to specific code that runs on another tier. For example if there is always a nice 1:1 correspondence between pushing a button in a Blazor front-end and OnButtonClicked() method running on a single back-end + all the expensive work happens inside OnButtonClicked() then the value of instrumenting the SignalR protocol is fairly low. On the other hand if those SignalR communications are bouncing across relay servers and lots of different code on different machines is being invoked in response then there are numerous places where performance issues could creep in. It would be very hard to track it down without timestamps and causal links on the messages being sent and received at each tier.

If we don't want to do the work of sending identifiers on each message I do think there are lower level amounts of work that would still be quite helpful:

  1. Starting a trace at the point each request is processed on the back-end. This is often the level of support that is present for web clients and servers. The operator of the service never gets any information about what happens client side but they do see that a request to a specific endpoint was generated and they can track that part of the work. If SignalR is only used in client->server communications and not in server<->server communications this could be a pretty comfortable place to be.
  2. Suppress generating a trace-id that wraps the entire WebSocket connection. In theory a sophisticated tracing tool could break it all apart and extract useful value, but in practice improvements to long circuit scenarios are unlikely to a priority at the tooling end for a long time to come. Avoiding emitting a trace id on the web-socket will mean that tracing tools create unique IDs on the server for small chunks of work like database requests or further HTTP messages. This allows users to at least diagnose performance issues in the downstream components like databases and other micro-services even if they can't correlate it back to specific UI actions or incoming server requests.

But I don't personally know if or how this sort of detail can be communicated usefully to an upstream tracing technology

For a server<->server SignalR communication many of the supporting pieces are already in place and I could walk you through what remains. For client<->server we presumably also need one of the tracing uploaders like OpenTelemetry or ApplicationInsights to support running in the blazor client environment. I'm happy to get the right people in touch but I suspect its more a question or priorities at this point.

"Within circuit abcdef, following receipt of event 4728, we made HTTP requests to the following URLs...". Does the "event 4728" information help the person reading the trace?

It helps if there is instrumentation on the other end of the communication to tie it to. For server<->server usage there would probably be some of that instrumentation in place. For Blazor client<->server it would be there only if you added it (which isn't hard) and a telemetry uploader supported transmitting it (this part is likely a bit more work).

Hope that helps? I'm glad to chat online if that is easier, I know there is a lot here : )

@angularsen
Copy link

angularsen commented Mar 24, 2021

We're having the same problem.

Setup:

  1. Blazor server side app, .NET 5
  2. Public app-facing API, .NET Core.
  3. Several internal APIs, .NET Core.

Example:

  1. Blazor app calls public API several times on startup, like authentication and GET /config.
  2. Blazor app then calls GET /Appointments.
  3. User may then POST /Appointments or other user actions.

If we are interested in debugging POST /Appointments in Azure Application Insights end-to-end transaction, then we see ALL distributed tracing for ALL Blazor requests to the public APIs, including all internal API calls, SQL commands, Redis requests etc, for each and every Blazor http request since the user visited the web app.

This becomes extremely verbose and makes debugging even for short user sessions with Application Insights end-to-end transaction impractical. I would expect to be able to explicitly start activities in Blazor to logically group related Blazor requests and to opt-out of the Blazor circuit's root activity.

A workaround we are experimenting with right now is setting Activity.Current = null in several key places, such as upon creating a new circuit and upon resolving specific early scoped dependencies (per circuit).

@ghost
Copy link

ghost commented Jul 20, 2021

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

@mkArtakMSFT mkArtakMSFT added Docs This issue tracks updating documentation Pillar: Technical Debt and removed webui-P3-TP labels Nov 5, 2023
@ghost
Copy link

ghost commented Dec 12, 2023

Thanks for contacting us.

We're moving this issue to the .NET 9 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

@mkArtakMSFT mkArtakMSFT added Pillar: Dev Experience Priority:1 Work that is critical for the release, but we could probably ship without and removed Pillar: Technical Debt labels Jan 11, 2024
@almostjulian
Copy link

+1 on guidance for working around this. Where to put Activity.Current = null exactly?

@danroth27 danroth27 removed the Docs This issue tracks updating documentation label Jan 25, 2024
@danroth27
Copy link
Member

SignalR is planning to add activities for SignalR invocations: #51557. We should understand the impact of this work on Blazor Server scenarios and what additional activities we need to add for Blazor Server specifically.

@danroth27 danroth27 changed the title Blazor Server / SignalR / WebSockets takes distributed tracing to strange new places Add activities to Blazor Server for distributed tracing Jan 25, 2024
@davidfowl
Copy link
Member

The situation here has improved in .NET 9. All Blazor server interactions will no longer be rooted under a single request span and instead, will render as individual spans in distributed tracing tools.

Blazor has no built-in activities, and we'll need to explore which of the many suggestions on this thread to implement. In .NET 9, you can create an activity to represent the logical operation being performed (for e.g. Buy Item) and that will show up something like this:

e.g.:

image

An HTTP request nested underneath the "IncrementCount" span.

image

@rynowak
Copy link
Member Author

rynowak commented Sep 17, 2024

That's awesome progress. I'm still a little disappointed to see the issue title change 😢. Some of my best work IMO.

@ericsampson
Copy link

I subscribed to this issue just to see the title pop up occasionally in my notifications and make me smile lol 🤣

@mikekistler mikekistler removed the area-signalr Includes: SignalR clients and servers label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected-few This issue impacts only small number of customers area-blazor Includes: Blazor, Razor Components enhancement This issue represents an ask for new feature or an enhancement to an existing one feature-blazor-server feature-observability Pillar: Dev Experience Priority:1 Work that is critical for the release, but we could probably ship without severity-major This label is used by an internal tool triaged
Projects
None yet
Development

No branches or pull requests