Add activities to Blazor Server for distributed tracing #29846

rynowak · 2021-02-02T05:46:26Z

Describe the bug

The long-lived circuits of Blazor server make distributed tracing not work as expected.

Since each circuit is effectively a long-lived request ... a lot of activity (pun intended) ends up getting traced under the same activity. Basically all outgoing requests/traces from inside of a Blazor server circuit (browser tab) will have the same parent span, which causes them to all get grouped together as a single "root cause" in tracing UIs.

To Reproduce

Create a Blazor server application. Use a button to fire off HTTP requests to another ASP.NET Core endpoint. Hook all of this up to your favorite distributed tracing system and enjoy.

All of the operations you start from a Blazor circuit will end inside the same single logical operation from a tracing POV.

rynowak · 2021-02-02T05:46:38Z

/cc @davidfowl @noahfalk

javiercn · 2021-02-02T08:58:39Z

@rynowak thanks for reporting for duty.

I imagine this is not only true for Blazor but for Websockets/SignalR in general.

Is there something we can do here to "disconnect" the distributed tracing?

Is it really correct if we somehow disconnect the Blazor Server session from the original request that started it?

ghost · 2021-02-02T08:58:49Z

Thanks for contacting us.
We're moving this issue to the Next sprint planning milestone for future evaluation / consideration. We will evaluate the request when we are planning the work for the next milestone. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

rynowak · 2021-02-02T16:44:22Z

Is there something we can do here to "disconnect" the distributed tracing?

I'm currently working around this by setting Activity.Current = null, and it seems to do the trick. I would imagine that a framework level fix would create a new root activity for each operation that comes through the circuit handler. You can use tags/baggage to attach things like a circuit ID so that correlation is possible.

So it is can be controlled ....

Is it really correct if we somehow disconnect the Blazor Server session from the original request that started it?

That's a great question 😁

I think the major downside of the current behavior is that a really really long-lived circuit could include thousands of descendant spans. Think about @danroth27 giving the counter demo for 10 hours as part of a charity livestream. In the worst case this will break downstream tooling, or in the best case just give you data that's hard to work with. I'm really not sure.

I might also be the first person to notice this, so I figured it was worth a discussion 🤷

javiercn · 2021-02-02T16:53:55Z

@rynowak you bring up good points.

I'm not super familiar with the activity id APIs, but In general I believe the fix here is to "detach it" from the original request activity and then to start a new one per received message on the circuit as you suggest, and then tag it with the circuit info

noahfalk · 2021-02-03T01:00:19Z

cc @tarekgh

Thanks for raising this @rynowak. I'll try to take a little time in the next day or two to repro this and understand where the Activities are currently being created so I have more context. I may also find alternative suggestions on how the issue can be resolved. I imagine as .NET developers use distributed tracing more frequently it will behoove us to have good guidance and understanding of common patterns that show up.

You can use tags/baggage to attach things like a circuit ID so that correlation is possible

Links are a new concept added in the 5.0 release of DiagnosticSource.dll that may also be useful here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/overview.md#links-between-spans

tarekgh · 2021-02-03T01:11:36Z

Can we get a debugger dump which we can see where the Activity objects get created. That may be faster I guess if possible. I am guessing there is some activities started and never stopped.

davidfowl · 2021-02-03T02:30:23Z

It's an issue we need to address in signalr and then we can have a broader discussion about what behavior we want in general.

noahfalk · 2021-02-04T09:43:45Z

I didn't have any background on how the Blazor app front-end and back-end communicate, but given @davidfowl's comment + a little debugging it appears to be SignalR protocol over WebSocket, is that accurate?

In repro app there was a short series of paired Activity start/stops but then 8th Activity start is when Kestrel receives the HTTP request that initiates the WebSocket and that one is long lived by design. Additional activity start/stop pairs get recorded re-entrantly as the server sends messages to itself when the button is clicked. In terms of what to do about it...

We probably want to define how a W3C trace id should be embedded in a SignalR request and then implement SignalR client and server code to transmit/receive those ids. This would allow us to have traces that start at the front-end and track fine-grained interactions such as clicking a button, similar to what is already possible using more traditional HTTP request/response patterns.
We need to figure out what happens if multiple layers of a protocol stack are sending different trace ids. I haven't seen prior art on this but maybe I haven't looked in the right places yet. I think we have a few options:

Avoid the issue by not sending a trace id on the HTTP GET that initiates the WebSocket
On the server when the nested SignalR protocol provides a new trace-id we could clear the HTTP level Activity and start the new one. Ideally we'd also restore the HTTP Activity after responding. There is a bug doing that today but I assume it will get fixed or we could work around it by manually setting Activity.Current. There may also be some value in adding a link between the HTTP level activity and the SignalR one.

danroth27 · 2021-02-04T17:24:10Z

[Blazor Server] appears to be SignalR protocol over WebSocket, is that accurate?

Yup

davidfowl · 2021-02-05T01:42:26Z

On the server when the nested SignalR protocol provides a new trace-id we could clear the HTTP level Activity and start the new one. Ideally we'd also restore the HTTP Activity after responding. There is a bug doing that today but I assume it will get fixed or we could work around it by manually setting Activity.Current. There may also be some value in adding a link between the HTTP level activity and the SignalR one.

Related to #18711

cc @BrennanConroy

davidfowl · 2021-02-08T13:36:48Z

I'm gonna take a stab at this. I think it's more than the activity though that's related here. I don't think we should be propagating the execution context from the request to the hub execution. It's especially broken for long polling scenarios where there are multiple requests throughout the execution of the hub.

ericsampson · 2021-02-10T17:32:28Z

Ryan wrote:

I might also be the first person to notice this, so I figured it was worth a discussion 🤷

I do remember someone else mentioning this in the context of SignalR a while back, I'm not having any luck finding the issue again unfortunately.

SteveSandersonMS · 2021-02-19T13:08:54Z

I'm not sure how useful it's going to be to have a separate activity ID for every message flowing across the websocket connection. The traces will then just say things like "Within circuit abcdef, following receipt of event 4728, we made HTTP requests to the following URLs...". Does the "event 4728" information help the person reading the trace?

So far it hasn't been a design goal to capture more information about each user activity within a Blazor Server circuit. If this was an area of growing customer demand, we could start pulling more information out of the rendertree and navigation manager so (for example) we could say things like "Within circuit abcdef, while on URL /counter, in reaction to a click event on a button element with class btn counter rendered by CounterComponent, we made HTTP requests to the following URLs...". But I don't personally know if or how this sort of detail can be communicated usefully to an upstream tracing technology. Also we'd have to use something like async locals to correlate actions to their causal events.

Another argument would be that tracing technologies could offer more value by moving beyond thinking in terms of short HTTP round trips and have first-class support for long-running connections (WebSockets in general, not just Blazor Server). I know that would only address the UX concern about how info is presented in the tracing UI, so it's still applicable to think about how Blazor Server can provide more human-readable context information as per the previous paragraph.

noahfalk · 2021-02-20T09:03:59Z

So far it hasn't been a design goal to capture more information about each user activity within a Blazor Server circuit

I think its fine to decide that this isn't a priority for you right now if you aren't seeing enough signal that customers care. My guess is that customers will care about this in proportion to how often apps using SignalR are sluggish and how easy it is to correlate actions on one machine to specific code that runs on another tier. For example if there is always a nice 1:1 correspondence between pushing a button in a Blazor front-end and OnButtonClicked() method running on a single back-end + all the expensive work happens inside OnButtonClicked() then the value of instrumenting the SignalR protocol is fairly low. On the other hand if those SignalR communications are bouncing across relay servers and lots of different code on different machines is being invoked in response then there are numerous places where performance issues could creep in. It would be very hard to track it down without timestamps and causal links on the messages being sent and received at each tier.

If we don't want to do the work of sending identifiers on each message I do think there are lower level amounts of work that would still be quite helpful:

Starting a trace at the point each request is processed on the back-end. This is often the level of support that is present for web clients and servers. The operator of the service never gets any information about what happens client side but they do see that a request to a specific endpoint was generated and they can track that part of the work. If SignalR is only used in client->server communications and not in server<->server communications this could be a pretty comfortable place to be.
Suppress generating a trace-id that wraps the entire WebSocket connection. In theory a sophisticated tracing tool could break it all apart and extract useful value, but in practice improvements to long circuit scenarios are unlikely to a priority at the tooling end for a long time to come. Avoiding emitting a trace id on the web-socket will mean that tracing tools create unique IDs on the server for small chunks of work like database requests or further HTTP messages. This allows users to at least diagnose performance issues in the downstream components like databases and other micro-services even if they can't correlate it back to specific UI actions or incoming server requests.

But I don't personally know if or how this sort of detail can be communicated usefully to an upstream tracing technology

For a server<->server SignalR communication many of the supporting pieces are already in place and I could walk you through what remains. For client<->server we presumably also need one of the tracing uploaders like OpenTelemetry or ApplicationInsights to support running in the blazor client environment. I'm happy to get the right people in touch but I suspect its more a question or priorities at this point.

"Within circuit abcdef, following receipt of event 4728, we made HTTP requests to the following URLs...". Does the "event 4728" information help the person reading the trace?

It helps if there is instrumentation on the other end of the communication to tie it to. For server<->server usage there would probably be some of that instrumentation in place. For Blazor client<->server it would be there only if you added it (which isn't hard) and a telemetry uploader supported transmitting it (this part is likely a bit more work).

Hope that helps? I'm glad to chat online if that is easier, I know there is a lot here : )

angularsen · 2021-03-24T09:07:16Z

We're having the same problem.

Setup:

Blazor server side app, .NET 5
Public app-facing API, .NET Core.
Several internal APIs, .NET Core.

Example:

Blazor app calls public API several times on startup, like authentication and GET /config.
Blazor app then calls GET /Appointments.
User may then POST /Appointments or other user actions.

If we are interested in debugging POST /Appointments in Azure Application Insights end-to-end transaction, then we see ALL distributed tracing for ALL Blazor requests to the public APIs, including all internal API calls, SQL commands, Redis requests etc, for each and every Blazor http request since the user visited the web app.

This becomes extremely verbose and makes debugging even for short user sessions with Application Insights end-to-end transaction impractical. I would expect to be able to explicitly start activities in Blazor to logically group related Blazor requests and to opt-out of the Blazor circuit's root activity.

A workaround we are experimenting with right now is setting Activity.Current = null in several key places, such as upon creating a new circuit and upon resolving specific early scoped dependencies (per circuit).

ghost · 2021-07-20T17:26:58Z

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

ghost · 2023-12-12T17:51:22Z

Thanks for contacting us.

We're moving this issue to the .NET 9 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

almostjulian · 2024-01-24T19:18:15Z

+1 on guidance for working around this. Where to put Activity.Current = null exactly?

danroth27 · 2024-01-25T17:34:42Z

SignalR is planning to add activities for SignalR invocations: #51557. We should understand the impact of this work on Blazor Server scenarios and what additional activities we need to add for Blazor Server specifically.

davidfowl · 2024-09-17T08:19:23Z

The situation here has improved in .NET 9. All Blazor server interactions will no longer be rooted under a single request span and instead, will render as individual spans in distributed tracing tools.

Blazor has no built-in activities, and we'll need to explore which of the many suggestions on this thread to implement. In .NET 9, you can create an activity to represent the logical operation being performed (for e.g. Buy Item) and that will show up something like this:

e.g.:

An HTTP request nested underneath the "IncrementCount" span.

rynowak · 2024-09-17T16:16:23Z

That's awesome progress. I'm still a little disappointed to see the issue title change 😢. Some of my best work IMO.

ericsampson · 2024-09-18T14:43:25Z

I subscribed to this issue just to see the title pop up occasionally in my notifications and make me smile lol 🤣

javiercn added the area-blazor Includes: Blazor, Razor Components label Feb 2, 2021

javiercn added this to the Next sprint planning milestone Feb 2, 2021

javiercn added the enhancement This issue represents an ask for new feature or an enhancement to an existing one label Feb 2, 2021

javiercn mentioned this issue Feb 2, 2021

[Blazor server] Blazor server improvements for 6.0 #27885

Closed

davidfowl self-assigned this Feb 8, 2021

davidfowl mentioned this issue Feb 8, 2021

Start connections on a fresh ExecutionContext #29995

Closed

SteveSandersonMS added the area-signalr Includes: SignalR clients and servers label Feb 19, 2021

SteveSandersonMS changed the title ~~Blazor server takes distributed tracing to strange new places~~ Blazor Server / SignalR / WebSockets takes distributed tracing to strange new places Feb 19, 2021

SteveSandersonMS added affected-few This issue impacts only small number of customers severity-major This label is used by an internal tool labels Feb 19, 2021 — with ASP.NET Core Issue Ranking

javiercn added the feature-blazor-server label Apr 20, 2021

mkArtakMSFT modified the milestones: Next sprint planning, Backlog Jul 20, 2021

mkArtakMSFT added Docs This issue tracks updating documentation Pillar: Technical Debt and removed webui-P3-TP labels Nov 5, 2023

mkArtakMSFT modified the milestones: Planning: WebUI, .NET 9 Planning Dec 12, 2023

mkArtakMSFT added Pillar: Dev Experience Priority:1 Work that is critical for the release, but we could probably ship without and removed Pillar: Technical Debt labels Jan 11, 2024

danroth27 removed the Docs This issue tracks updating documentation label Jan 25, 2024

danroth27 mentioned this issue Jan 25, 2024

Web analytics and diagnostics #5461

Closed

3 tasks

danroth27 changed the title ~~Blazor Server / SignalR / WebSockets takes distributed tracing to strange new places~~ Add activities to Blazor Server for distributed tracing Jan 25, 2024

mkArtakMSFT modified the milestones: .NET 9 Planning, 9.0-preview6 Jan 30, 2024

mkArtakMSFT mentioned this issue Jan 30, 2024

ASP.NET Core Roadmap for .NET 9 #51834

Open

andrewlock mentioned this issue Feb 13, 2024

Auto-instrumentation for Blazor DataDog/dd-trace-dotnet#5156

Open

guardrex mentioned this issue Feb 26, 2024

Blazor .NET 9 tracking issue dotnet/AspNetCore.Docs#31909

Closed

mkArtakMSFT modified the milestones: 9.0-preview6, 9.0-preview7 May 29, 2024

mkArtakMSFT modified the milestones: 9.0-preview7, .NET 10 Planning Jun 26, 2024

mkArtakMSFT added the triaged label Nov 1, 2024

mikekistler removed the area-signalr Includes: SignalR clients and servers label Nov 5, 2024

mkArtakMSFT added the feature-observability label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add activities to Blazor Server for distributed tracing #29846

Add activities to Blazor Server for distributed tracing #29846

rynowak commented Feb 2, 2021 •

edited

Loading

rynowak commented Feb 2, 2021

javiercn commented Feb 2, 2021

ghost commented Feb 2, 2021

rynowak commented Feb 2, 2021

javiercn commented Feb 2, 2021

noahfalk commented Feb 3, 2021

tarekgh commented Feb 3, 2021

davidfowl commented Feb 3, 2021

noahfalk commented Feb 4, 2021 •

edited

Loading

danroth27 commented Feb 4, 2021

davidfowl commented Feb 5, 2021

davidfowl commented Feb 8, 2021

ericsampson commented Feb 10, 2021

SteveSandersonMS commented Feb 19, 2021 •

edited

Loading

noahfalk commented Feb 20, 2021

angularsen commented Mar 24, 2021 •

edited

Loading

ghost commented Jul 20, 2021

ghost commented Dec 12, 2023

almostjulian commented Jan 24, 2024

danroth27 commented Jan 25, 2024

davidfowl commented Sep 17, 2024

rynowak commented Sep 17, 2024

ericsampson commented Sep 18, 2024

Add activities to Blazor Server for distributed tracing #29846

Add activities to Blazor Server for distributed tracing #29846

Comments

rynowak commented Feb 2, 2021 • edited Loading

Describe the bug

To Reproduce

rynowak commented Feb 2, 2021

javiercn commented Feb 2, 2021

ghost commented Feb 2, 2021

rynowak commented Feb 2, 2021

javiercn commented Feb 2, 2021

noahfalk commented Feb 3, 2021

tarekgh commented Feb 3, 2021

davidfowl commented Feb 3, 2021

noahfalk commented Feb 4, 2021 • edited Loading

danroth27 commented Feb 4, 2021

davidfowl commented Feb 5, 2021

davidfowl commented Feb 8, 2021

ericsampson commented Feb 10, 2021

SteveSandersonMS commented Feb 19, 2021 • edited Loading

noahfalk commented Feb 20, 2021

angularsen commented Mar 24, 2021 • edited Loading

ghost commented Jul 20, 2021

ghost commented Dec 12, 2023

almostjulian commented Jan 24, 2024

danroth27 commented Jan 25, 2024

davidfowl commented Sep 17, 2024

rynowak commented Sep 17, 2024

ericsampson commented Sep 18, 2024

rynowak commented Feb 2, 2021 •

edited

Loading

noahfalk commented Feb 4, 2021 •

edited

Loading

SteveSandersonMS commented Feb 19, 2021 •

edited

Loading

angularsen commented Mar 24, 2021 •

edited

Loading