Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnostics Channel - Next steps #180

Closed
8 tasks
mhdawson opened this issue Apr 2, 2018 · 24 comments
Closed
8 tasks

Diagnostics Channel - Next steps #180

mhdawson opened this issue Apr 2, 2018 · 24 comments
Labels
diag-deepdive-agenda Used for agenda items related to diagnostic deep dive sessions never stale Tracking-Issue

Comments

@mhdawson
Copy link
Member

mhdawson commented Apr 2, 2018

The concept of a diagnostic channel was discussed in the Diagnostics summit as well as here: #134

In a nutshell the concept is to provide an API that modules can use to publish events, and an API that consumers (such as APM vendors) can consume. The key differences between the channel provided by these APIs and trace events are:

  • operation is synchronous, the consumer has the ability to take an action before the publish event completes on the emitter side.
  • data is passed (and any other data available indirectly) by the publish and this data can be mutated by the consumer.

This approach requires both implementation of the diagnostic channel as well as updates to modules in order to use the API. An interim step where monkey patching is used to add calls to the API is seen as a good way to bootstrap.

Microsoft already has an experimental implementation which we should evaluate to see what parts (if any) that we can re-use.

An initial cut at next steps include:

  • Create a new repo where we can experiment with a new implementation under the nodejs org - diagnostic-channel
  • review the existing experimental implementation
  • create performance regression suite
  • define the diagnostic publish API
  • define the diagnostic consumer API
  • pull what parts we can from existing implementation into publish API
  • pull what parts we can from existing implementation into consumer API
  • Iterate until we hit a 0.1.0 release that is ready for broader review
@mcollina
Copy link
Member

mcollina commented Apr 3, 2018

Overhead and performance should be a first level concern about this API. It should result in no-ops if it is not enabled, and get smart if it is enabled. I think we should trade usability for performance whenever possible, as long as it still satisfy our use case.

@jkrems
Copy link
Contributor

jkrems commented Apr 3, 2018

@mcollina - [ ] Create performance regression suite?

@mcollina
Copy link
Member

mcollina commented Apr 3, 2018

@jkrems 👍 !

@jkrems
Copy link
Contributor

jkrems commented Apr 3, 2018

Added to the top post

@Qard
Copy link
Member

Qard commented Apr 4, 2018

I brought it up in the diag meeting just now, but I worked a bit on a proof-of-concept universal diagnostics channel type thing recently, after I got back from the Ottawa summit and then Elastic{ON}. It's a very rough demo and definitely needs more perf focus before it could be actually usable, but you can have a look at it here: https://github.com/Qard/universal-monitoring-bus

It uses subscription queries to express interest in certain messages in the channel, similar to trace_events, but in a currently more flexible way (sacrificing performance for flexibility while trying to identify interesting filter data). It has no direct graph-forming, instead opting to direct the message bus into aggregators, which can live in-process or out-of-process, and makes no assumptions about transport mechanism. It is simply a callback that receives JSON-serialization-ready objects that you can do whatever you like with.

It also expresses both spans and metrics updates uniformly in a single message format, with purpose-specific extension, allowing for an easy shared communication channel or aggregator re-use and even correlation between the spans and metrics data.

@Qard
Copy link
Member

Qard commented Aug 5, 2020

I'm picking this back up now and working on some ideas of how to reduce/eliminate the runtime overhead. My main concern is that the prototype is basically just an event emitter so it has to lookup the subscribers on every publish. It does have a shouldPublish method to avoid constructing input data if there's nothing listening, but it's basically functionally identical to the listeners method of an event emitter which just results in an additional lookup. And approach I've been looking at is named channel objects that can be acquired once and reused as a way to make the lookup only happen once at the time of trying to acquire the channel.

I also don't think the module patches and context management aspects should be included. For core adoption, we should stick strictly to the data firehose component and strip that down to the bare essentials for something speedy and highly extendable in userland.

@mmarchini mmarchini added the diag-deepdive-agenda Used for agenda items related to diagnostic deep dive sessions label Aug 5, 2020
@Qard
Copy link
Member

Qard commented Aug 9, 2020

Here's my prototyping progress this week: https://gist.github.com/Qard/b21bb8cb1be03274f4a8c585c659adf2

I tried to follow the original design, for the most part, but added regex subscribers and the concept of channel and subscriber objects to take the lookup cost out-of-band from the publish so it doesn't have to do the lookup every time. The result is publish overhead on-par with a single variable increment, which is pretty good!

What do people think? Is the simple text name a flexible enough way to search for events in the firehose? Another possibility is object matching, which I did in this prototype several years back. I think it's a bit friendlier and more flexible, but I couldn't figure out a good way to pull the object-style search out-of-band from the messaging like I did for the prototype above. 🤔

@mmarchini
Copy link
Contributor

Some questions from today's meeting:

  1. Any plans on supporting publishers in C++ (not as a public API necessarily, but as a mechanism to instrument C++ code in core)
  2. there's publish and publishIfActive. IMO it would be better ergonomics for publish to, by default, only publish if active, and then have a forcePublish or something to publish even if inactive.
  3. I wonder if we should have a concept of "parent publisher" to avoid, so that publishing to a child would also publish to the parent. For example, have fs as a parent and fs.readFile as a child publisher, publishing to fs.readFile also publishes to fs (and following 2, it only publishes to active publishers by default, so if fs is active and fs.readFile is not, it would only publish to fs
  4. How does it compare to the existing trace_events module? Could it be a functionality of trace_events instead of a new module?
  5. Based on 4 and assuming we have it separate from trace_events, do we want to have publisher instrumentation on the same places we have trace_events? How to we keep those in sync in that case? How do we decide what to instrument and how do we measure progress of instrumentation in core?
  6. Mentioned in the meeting and on Slack, but just reinforcing I think starting with an userland module published to npm is the way to go to experiment with this until we need something core-specific or until the API is closer to be solidified.
  7. On current implementation subscribers receive the messages syncronously. Should we async subscribers as an option or even by default?
    a) Any chance we can provide a channel that allows publisher and subscribers to be in different isolates (so we could have a worker thread as subscriber, for example)?

Overall I think it's a great start for the API and it would be good to get input from potential consumers (APMs) and publishers (frameworks) before attempting to land it in core.

@Qard
Copy link
Member

Qard commented Aug 12, 2020

  1. Not sure about C++ yet. I haven't figured out if there's anywhere specific we'd want to capture C++-side data and, if so, how we could do that in a reasonably fast way.
  2. The reason for publishIfActive(...) is to defer the input data construction to only happen if the channel is active. It's the same as if (channel.shouldPublish()) channel.publish(data) though, so possibly better to tell users to use that. At the moment it only really exists as an alternative API to compare performance and ergonomics. I do think we should settle on one way to do it before landing though.
  3. That complicates the publish logic somewhat and makes it more expensive to decide if there is work to do at runtime. It'd have to walk the tree upward for every publish because the end of the branch might be inactive but other nodes in the graph might be active. It also imposes some structural opinions on the shape a channel can take from the perspective of the subscriber whereas I'd prefer the publisher to be as simple as possible and offload complicated/expensive logic as much as possible to the subscriber and hopefully out-of-band as much as possible. Performance when not listening is a top priority here.
  4. The trace_events API is not consumable within JavaScript so not really useful for production monitoring, it's more of a post-mortem analysis tool. It's also target more specifically at chrome://tracing. It's not really suitable for broadcasting arbitrary/unstructured data.
  5. There may be a few places where we broadcast data for both systems but I don't see there being a lot of overlap.
  6. I disagree about making it a userland module. Most of the data that we (APM vendors) would want to gather from it is from core APIs. Starting as a userland module would just make it even harder to get it into core. (as evidenced by it taking literally half a decade for Node.js to get an async context management API despite APMs asking for it repeatedly for years. 😬 )
  7. At the moment, having it sync is intentional. Timestamps can be added to messages on the consuming end that way. If we move it to async we lose the timing certainly and would likely need to do more within the API--possible, but trying to avoid that unless it's really necessary. As for the worker thread idea, I've thought about it. At the moment I'm looking more at the ergonomics and concepts of the API than the specific implementation details like that.

There's a lot to consider in how the ecosystem would want to engage with such an API to be able to deliver something that is satisfying, especially given our increasing focus on long-term vision of the project. We need to make sure this is not just relevant now, but will remain relevant and useful for the foreseeable future.

I have a few months to hack on this so there's plenty of room for iteration. We'll see how successful I am in getting this to a place where people are happy with it and it can deliver on the promise of a high-performance unified data firehose. 😅

@Flarna
Copy link
Member

Flarna commented Aug 14, 2020

Sorry for not participating the meetings but the meeting times we have are in general hard to catch for me.

Regarding (7): I think it's vital that this is sync to allow subscribers to correlate with their current state/context. Just getting the info that e.g. a DB query XXX is sent is not that helpful on its own. APMs correlate events to show e.g. that query X happened during processing request Y. As X and Y in general originate from different modules the events itself can't hold this information.

I'm not sure if we need the regex subscribers at all. Published data is only of use if subscriber know the structure of the content for each event. Therefore I think subscribers are quite specific and using them for a possibly unbounded amount of publishers sounds not that reasonable. Maybe for something like a log/batching subscriber which just pushes the data as is or stringified further to the next consumer.

@RafaelGSS
Copy link
Member

Let me share my opinions too:

  1. I agree with @Flarna about the regex.
  2. I would like to see this prototype in action taking by example a database_module just to validate and see the gains with it. I would like to help you with this if you agree 😄.

And I have a question too:

  1. Do you have some idea of what kind of information from node-core we would like to listen/publish? I see this feature as a good idea to all diagnostics on the node, like cpu monitor, memory...

@mmarchini
Copy link
Contributor

@Flarna good point on the regex subscriber. Can we turn the Diagnostics Channel into an EventEmitter and emit and event every time a publisher or subscriber is added or removed? This way we can replace the regex with a more flexible approach, where the user fetches all publishers, filters out to get what they want, and listens to the added/removed events so they can update the list of publishers they're subscribing to. This could even be extended (possibly in userland) as a RegExSubscriber, which keeps a cache of publishers that match a regex using the list + events.

@Qard
Copy link
Member

Qard commented Aug 14, 2020

Yes, that's basically what I have locally, though it's not done yet. Replacing the regex-specific subscribers concept with a generalized object that has a function like shouldSubscribe(channelName) that does whatever it needs to do with the channel name and returns true or false to tell it if it should listen, then it has onMessage(message) to receive the messages. The regex matching subscriber can easily be implemented on top of that.

As for the need for sync: yes, I mostly agree with that assessment. There's performance implications to that though, moving the subscriber execution into the direct execution path. It also might be impossible to do sync if we consider possibilities like subscribing from a worker thread, so context management may need to get baked in, if we go that route.

This is all very early in the experimentation process--I've only been working on this a week and a bit, so still just experimenting with ideas and getting for a feel for what the possibilities are in what the ultimate implementation could look like. The code as it is now is intended only as a proof-of-concept of API ergonomics and as a way to test approximate performance impact of injecting such a thing into core code paths. The code is likely to evolve considerably before actually considering it for core.

@Flarna
Copy link
Member

Flarna commented Aug 14, 2020

I'm curious why async is seen as more performant then sync in this case.
Async implies that some task needs to be created and enqueued in addition to create the data object to publish. In case of workers the data must be additionally serialized.
I would assume that subscribers either do very little work in the callback or trigger an async task as needed by them.

@Qard
Copy link
Member

Qard commented Aug 14, 2020

I'm a bit uncertain also, but I've got comments suggesting moving APM agents into worker threads to take their overhead out of the main application, which seems possibly reasonable in theory anyway. I'm personally a bit skeptical that would be better though. 🤔

@Flarna
Copy link
Member

Flarna commented Aug 14, 2020

Collecting the data must be in main thread. What can be moved is serialization/compression/correlation of the collected data and transmission to the APM server. But this is something APM tools have to work on and not the diagnostics channel.
Worst case would be to publish data to a worker and the subscriber there just forwards it back to the main thread...

@Qard
Copy link
Member

Qard commented Aug 14, 2020

Well, ideally if diagnostics channel does what it's intended to do, the need for APM code to run on the main thread should mostly disappear. Context management is the one challenge there though which leads me to think that might not be entirely feasible. If at all possible, I'd love to be able to fully isolate all APM code to a separate thread, but I'm not convinced that's doable without some fairly major changes to Node.js.

@Flarna
Copy link
Member

Flarna commented Aug 14, 2020

ok so the main thread publishes data; either via some sensors placed by the APM tool or by modules having this built in. This data is collected in a worker, processed and transmitted.
Sounds fine so far.
Thinking on a generic DiagnosticsChannel I still question the async publish to a worker as one tool may run in one worker and another tool in a different worker. So somehow the publish target needs to be part of the subscriber.

@mmarchini
Copy link
Contributor

fwiw Timestamps can be added to messages on the consuming end that way was enough to convince me that sync in the main thread for the API we expose is the best approach (I was thinking timestamps coming from the publisher, but I guess it makes more sense on the subscriber) :)

@Qard
Copy link
Member

Qard commented Aug 14, 2020

Yeah, it's my intuition that it's better to just keep it in the main thread and encourage APMs themselves to dispatch that to their own threaded thing. Timestamps are an easy fix with a wrapper object, but context management gets harder, especially with multi-tenancy.

@Qard
Copy link
Member

Qard commented Aug 23, 2020

Pull request opened! 🎉 nodejs/node#34895

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions github-actions bot added the stale label Nov 22, 2020
@mhdawson
Copy link
Member Author

Removing stale as @quad is working on it.

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
diag-deepdive-agenda Used for agenda items related to diagnostic deep dive sessions never stale Tracking-Issue
Projects
None yet
Development

No branches or pull requests

8 participants