-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance profiles for typical Node.js applications #161
Comments
Agree it is useful for end-users to be able to reason about perf characteristics. Challenge with what you're suggesting is going to be defining what is a "typical" application, and then what is a "typical" host environment. E.g., running on high-end dedicated hw is going to produce one set of values for throughput, running same app on a public cloud where you're randomly co-located on hw with "noisy neighbors" is going to produce an entirely different set of results. A solid outcome (IMO, feel free to disagree :) is we have a set of applications that represent different workloads, and there are prescriptive instructions for how end users can consistently replicate and share the results (i.e., a fixed set of measures and fixed format report). This would enable more consistent comparisons, and enable comparisons across different host environments. Agree this overlaps w/ benchmarking WG. |
@mike-kaufman I'm more interested in having a sampling of real-world code than more "standard" benchmarks. The basic question that is often asked is "is my application as fast as it can be" (within reason)? Am I doing something wrong? I don't think having standard benchmarks would solve these scenarios. |
Not sure I'm clear on the difference between "a sampling of real-world-code" and "a workload benchmark" (e.g. acme air). My thinking is you can define some canonical apps that represent certain workloads (e.g., a DB-backed CRUD app, a socket proxy app,...), and then these can be used to drive meaningful comparisons (host A vs host B, my CRUD app vs canonical CRUD app, node 9 vs node 10,...) If you have an idea or some code that represents what you have in mind, that might help me better understand your point. |
I think what we are looking for are:
The main difference between this and the benchmarking WG's effort is that we are looking for more real-world, in-production data, without actually getting the code. Therefore, we can obtain those statistics, say, using surveys with appropriate questions. This in fact looks a bit similar to the foundations' user survey questions, but goes into more details and looks for numbers. |
Another way to get those data is asking the APM vendors to help us collect them after given consent from the users. In return they get the appropriate performance profiles to pattern match so they can remind the users when the applications underperform, and even give advices based on the data. |
Another way to use this data/report: there are many users coming to the nodejs/node issue tracker with all kinds of memory/CPU usage graphs from their APM providers, asking questions about the performance of their applications, but without actual code, those issues are not really actionable and often get closed due to inactivity. With this kind of data in place, we can redirect them there instead of giving some vague answers, and this helps us triage the actual performance issues. |
I would not rely on APMs here. I'd also rather opt for a standardized set of key metrics that can be collected from a running process and then sent to a third party for further inspection. Even better would be:
Doing so would allow us to source comparable real-world performance information for many different configurations. It would be also possible to integrate this with APM via events. |
It's definitely hard to come up with an exact X for "any CRUD web service". At the same time a new service doesn't have a baseline but might be interested in how it stacks up. The thing that APM providers might know though is "what is the general profile of [a class of] node apps". E.g. what's the throughput/latency distribution/memory usage of web services that make a few database calls? Getting those numbers for one or two services will be a poor representation of "normal". Getting those numbers for hundreds or thousands of services and looking at distributions might be more telling. |
So, entirely possible I'm not understanding what's being proposed here. That said, I feel like the goal (having a set of KPIs that can be used to gauge perf of their app) lacks any controls. It's analogous to saying "water boils in n seconds" w/out controlling for initial temp of water, BTU output of stove, volume of water or altitude.
This is an interesting way to frame this, and aligns with the trace-macros efforts & the eliminating monkey patching efforts. |
Fully agreed.
Agreed - I can think of a combined metric that consists of
Plus metadata like Node version, etc. If we are able to collect that for a range of servers regressions between versions would be detectable. |
Yes, I agree. We need additional input for measurement. What I have in mind as the end result is something like a "performance calculator", or "cost calculator" - the users can give us:
We can give them in return:
So, the ideal would be, the users can know if their existing applications are underperforming / costing more than they should by providing those input, and can have a better idea about the kind of cloud resources/host that they need to get before they bring something in production. I think our first goal is to figure out what kind of input we need, and what kind of output we can give in return. |
That sounds good but I wonder how to source the raw data to provide such insights. |
I'm +1 on the idea of having a standard set of "runtime perf metrics" that can come out of node. I think this is the first step to what's being suggested, and overlaps nicely with the goal of having more runtime diagnostic info output through trace macros. If the scope here is the standard metrics, there's overlap with #131. Would be nice if we can come to some agreement on this & close/consolidate issues as appropriate. RE Joyee's suggestion about the "cost calculator", I love the idea of it in the abstract. My worry is there are just too many variables at play to land this successfully. Ideally, we want to be able to surface easy-to-understand perf data in easy-to-use tooling so that "mere mortal" developers can get a sense of how their app is performing & what options they have to improve it. |
Having a set of "runtime perf metrics" that are easy to extract and process would be fantastic. It's a great starting point. This can also live in the community/ecosystem, something that you either enable with |
I think this has some overlap with the tracing library. E.g., for ETW or Dtrace, you would just need to attach a listener and you'd get the events - you wouldn't need a special build or special switches at startup. Challenge here, I think, is a model that works across different trace libraries. See table here for some more detail. |
Bumping this since some stuff had changes since early 2018 and specially the launch of OpenTelemetry, which works as a "standard" way to gather metrics/trace from an application and send them to any backend. What do you think ? |
I agree that working in the context of OpenTelemetry makes sense. I've done some work with OpenCensus and I think it's a good starting point for us to think about when talking about how to delivery base metrics. |
@hekike FYI as well since I know you had contributes some changes to OpenCensus as well. |
Yes, at the moment we are experimenting with using OpenCensus at Netflix for our Node platform. I would be interested in such a runtime metrics exporter, especially now that OpenTelemetry is happening. I could imagine an official solution provided by Node Core to instrument core modules like HTTP and extract performance metrics instead of OpenCensus's current monkey patching approach. (low-cost, stable built-in context propagation would be also awesome) About the original question, I can ask around what data can we share with a community from our use-cases. We run various API workloads with technologies like GraphQL, gRPC, React rendering, etc. on both Netflix scale and smaller internal / partner scale. |
I think we should distinguish here between metrics and tracing. Adding metrics as described above into node core should be quite easy as reporting one metric has no side impact to the other and there is no needed to have a "world wide" working context passing in place. Traces/Transactions are a lot harder as they require working context passing which in turn depends on user modules. |
@Flarna the context is here that OpenTelemetry can extract both timeseries and tracing data from a process and report it to various backends. On your statement that timeseries data is always transaction independent I slightly disagree. For example for downstream RPCs, it can be useful to breakdown metrics by down/upstream dependencies. Here are some examples:
So 👍 on separation, but I do expect in the future these topics will bleed into each other. |
Fully agree here. My main point was that there a lot interesting but transaction independent metrics out there which could be reported already now without waiting on the long running topic regarding context passing in Node.JS being closed (which actually may never happen at all). |
As discussed in nodejs/diagnostics#161, the core should expose important metrics about the runtime, this PR's goal is to let user get the number of io request made, and lower level mertrics like the page faults and context switches.
As discussed in nodejs/diagnostics#161, the core should expose important metrics about the runtime, this PR's goal is to let user get the number of io request made, and lower level mertrics like the page faults and context switches. PR-URL: #28018 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
As discussed in nodejs/diagnostics#161, the core should expose important metrics about the runtime, this PR's goal is to let user get the number of io request made, and lower level mertrics like the page faults and context switches. PR-URL: #28018 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
As discussed in nodejs/diagnostics#161, the core should expose important metrics about the runtime, this PR's goal is to let user get the number of io request made, and lower level mertrics like the page faults and context switches. PR-URL: #28018 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
During the break out session on day 1 of this week's diagnostic summit, we have discussed about providing performance profiles for typical Node.js applications e.g. CPU & memory usage, GC overhead, throughput, latency, .etc so the users could have a clear mental picture about how their applications perform and if there is enough room for optimization.
This seems to be somewhat overlapping with the work of the benchmarking working group, since they are looking for real-world workloads as well. Although in our case we don't look for a open code base, rather, we need some average statistics about typical Node.js applications.
The text was updated successfully, but these errors were encountered: