Performance profiles for typical Node.js applications #161

joyeecheung · 2018-02-14T09:55:04Z

During the break out session on day 1 of this week's diagnostic summit, we have discussed about providing performance profiles for typical Node.js applications e.g. CPU & memory usage, GC overhead, throughput, latency, .etc so the users could have a clear mental picture about how their applications perform and if there is enough room for optimization.

This seems to be somewhat overlapping with the work of the benchmarking working group, since they are looking for real-world workloads as well. Although in our case we don't look for a open code base, rather, we need some average statistics about typical Node.js applications.

mike-kaufman · 2018-02-14T14:17:13Z

Agree it is useful for end-users to be able to reason about perf characteristics. Challenge with what you're suggesting is going to be defining what is a "typical" application, and then what is a "typical" host environment. E.g., running on high-end dedicated hw is going to produce one set of values for throughput, running same app on a public cloud where you're randomly co-located on hw with "noisy neighbors" is going to produce an entirely different set of results.

A solid outcome (IMO, feel free to disagree :) is we have a set of applications that represent different workloads, and there are prescriptive instructions for how end users can consistently replicate and share the results (i.e., a fixed set of measures and fixed format report). This would enable more consistent comparisons, and enable comparisons across different host environments.

Agree this overlaps w/ benchmarking WG.

mcollina · 2018-02-16T23:38:40Z

@mike-kaufman I'm more interested in having a sampling of real-world code than more "standard" benchmarks. The basic question that is often asked is "is my application as fast as it can be" (within reason)? Am I doing something wrong? I don't think having standard benchmarks would solve these scenarios.

mike-kaufman · 2018-02-16T23:59:23Z

Not sure I'm clear on the difference between "a sampling of real-world-code" and "a workload benchmark" (e.g. acme air). My thinking is you can define some canonical apps that represent certain workloads (e.g., a DB-backed CRUD app, a socket proxy app,...), and then these can be used to drive meaningful comparisons (host A vs host B, my CRUD app vs canonical CRUD app, node 9 vs node 10,...)

If you have an idea or some code that represents what you have in mind, that might help me better understand your point.

joyeecheung · 2018-02-17T07:02:01Z

I think what we are looking for are:

Categories of typical Node.js applications and their use cases, like "DB-backed CRUD app, a socket proxy app" that @mike-kaufman mentioned above - so when we put out a table, the users know where to look
Their performance statistics e.g. latency, throughput, Apdex, with the latency of their external resources for reference - so users can adjust their expectations accordingly
How those applications scale, either horizontally or vertically (where CPU & memory usage stats are useful), and what the relationship between resource usage and their throughput/latency are - this helps users know when they are spending more than they should, and helps them plan the allocations for e.g. cloud resources
Availability of these applications: e.g. uptime/downtime, number of nines - so users know when their applications are not doing well, and how to adjust their SLAs
Other factors: the framework/RPC protocol/etc. they use, the version of Node.js they are on, the cloud provider/host they are on, the monitoring solutions they use

The main difference between this and the benchmarking WG's effort is that we are looking for more real-world, in-production data, without actually getting the code. Therefore, we can obtain those statistics, say, using surveys with appropriate questions. This in fact looks a bit similar to the foundations' user survey questions, but goes into more details and looks for numbers.

joyeecheung · 2018-02-17T07:11:30Z

Another way to get those data is asking the APM vendors to help us collect them after given consent from the users. In return they get the appropriate performance profiles to pattern match so they can remind the users when the applications underperform, and even give advices based on the data.

joyeecheung · 2018-02-17T07:29:33Z

Another way to use this data/report: there are many users coming to the nodejs/node issue tracker with all kinds of memory/CPU usage graphs from their APM providers, asking questions about the performance of their applications, but without actual code, those issues are not really actionable and often get closed due to inactivity. With this kind of data in place, we can redirect them there instead of giving some vague answers, and this helps us triage the actual performance issues.

danielkhan · 2018-02-19T14:31:47Z

Another way to get those data is asking the APM vendors to help us collect them after given consent from the users. In return they get the appropriate performance profiles to pattern match so they can remind the users when the applications underperform, and even give advices based on the data.

I would not rely on APMs here.
If APMs report performance problems (and not just metrics) they use baselining and anomaly detection for a given application. There are just too many factors to rely on a static performance profile to compare against.

I'd also rather opt for a standardized set of key metrics that can be collected from a running process and then sent to a third party for further inspection.

Even better would be:

Users can opt-in to performance reporting
Performance snapshots are sent to a central repository
Big data analyses reports anomalies or improvements

Doing so would allow us to source comparable real-world performance information for many different configurations. It would be also possible to integrate this with APM via events.
Of course, this would involve plenty of work and would also require hardware to run the benchmarking service on.

jkrems · 2018-02-19T15:34:02Z

If APMs report performance problems (and not just metrics) they use baselining and anomaly detection for a given application.

It's definitely hard to come up with an exact X for "any CRUD web service". At the same time a new service doesn't have a baseline but might be interested in how it stacks up. The thing that APM providers might know though is "what is the general profile of [a class of] node apps". E.g. what's the throughput/latency distribution/memory usage of web services that make a few database calls? Getting those numbers for one or two services will be a poor representation of "normal". Getting those numbers for hundreds or thousands of services and looking at distributions might be more telling.

mike-kaufman · 2018-02-19T17:16:35Z

So, entirely possible I'm not understanding what's being proposed here. That said, I feel like the goal (having a set of KPIs that can be used to gauge perf of their app) lacks any controls. It's analogous to saying "water boils in n seconds" w/out controlling for initial temp of water, BTU output of stove, volume of water or altitude.

I'd also rather opt for a standardized set of key metrics that can be collected from a running process and then sent to a third party for further inspection.

This is an interesting way to frame this, and aligns with the trace-macros efforts & the eliminating monkey patching efforts.

danielkhan · 2018-02-19T23:38:23Z

That said, I feel like the goal (having a set of KPIs that can be used to gauge perf of their app) lacks any controls.

Fully agreed.

E.g. what's the throughput/latency distribution/memory usage of web services that make a few database calls? Getting those numbers for one or two services will be a poor representation of "normal". Getting those numbers for hundreds or thousands of services and looking at distributions might be more telling.

Agreed - I can think of a combined metric that consists of

RSS, Heap usage
GC frequency
CPU usage
some event loop counters
RPS
Response time

Plus metadata like Node version, etc.

If we are able to collect that for a range of servers regressions between versions would be detectable.

joyeecheung · 2018-02-20T10:55:58Z

I feel like the goal (having a set of KPIs that can be used to gauge perf of their app) lacks any controls. It's analogous to saying "water boils in n seconds" w/out controlling for initial temp of water, BTU output of stove, volume of water or altitude.

Yes, I agree. We need additional input for measurement. What I have in mind as the end result is something like a "performance calculator", or "cost calculator" - the users can give us:

What kind of work the application do
The version of Node they are using
For each request/unit of work, how many request to external systems/DB will be sent, what are the expected latency of those, how likely those will be triggered (e.g. if it's a external cache). There could be a vector of these types of requests because for example, a request that involves a lot of writes to DB has a different performance profile from one that involves mostly reads that are likely to hit the cache.
Other factors that we can come up with that are meaningful

We can give them in return:

Expected average/percentile of metrics, e.g. those mentioned in Performance profiles for typical Node.js applications #161 (comment)

So, the ideal would be, the users can know if their existing applications are underperforming / costing more than they should by providing those input, and can have a better idea about the kind of cloud resources/host that they need to get before they bring something in production.

I think our first goal is to figure out what kind of input we need, and what kind of output we can give in return.

danielkhan · 2018-02-20T13:56:22Z

We can give them in return: Expected average/percentile of metrics, e.g. those mentioned in #161 (comment)

That sounds good but I wonder how to source the raw data to provide such insights.
From the data I have, I could not provide that to our customers without having some kind of cross reference to expected performance measures.
That's why I suggest to tackle that with a way to collect metrics from applications.

mike-kaufman · 2018-02-20T19:34:16Z

I'm +1 on the idea of having a standard set of "runtime perf metrics" that can come out of node. I think this is the first step to what's being suggested, and overlaps nicely with the goal of having more runtime diagnostic info output through trace macros.

If the scope here is the standard metrics, there's overlap with #131. Would be nice if we can come to some agreement on this & close/consolidate issues as appropriate.

RE Joyee's suggestion about the "cost calculator", I love the idea of it in the abstract. My worry is there are just too many variables at play to land this successfully. Ideally, we want to be able to surface easy-to-understand perf data in easy-to-use tooling so that "mere mortal" developers can get a sense of how their app is performing & what options they have to improve it.

mcollina · 2018-02-28T11:42:58Z

Having a set of "runtime perf metrics" that are easy to extract and process would be fantastic. It's a great starting point.

This can also live in the community/ecosystem, something that you either enable with -r simplemetrics or that you can query live for a status endpoint or similar.

mike-kaufman · 2018-02-28T15:54:10Z

something that you either enable with -r simplemetrics or that you can query live for a status endpoint or similar.

I think this has some overlap with the tracing library. E.g., for ETW or Dtrace, you would just need to attach a listener and you'd get the events - you wouldn't need a special build or special switches at startup. Challenge here, I think, is a model that works across different trace libraries. See table here for some more detail.

vmarchaud · 2019-05-28T20:15:02Z

Bumping this since some stuff had changes since early 2018 and specially the launch of OpenTelemetry, which works as a "standard" way to gather metrics/trace from an application and send them to any backend.
Even if it's hard to tell the user what is a good/bad value for a given metric, i suppose a good first step would be to make them available in the first place and i believe open telemetry would be a good place to start this implementation.

What do you think ?

mhdawson · 2019-05-28T20:46:27Z

I agree that working in the context of OpenTelemetry makes sense. I've done some work with OpenCensus and I think it's a good starting point for us to think about when talking about how to delivery base metrics.

mhdawson · 2019-05-28T20:46:46Z

@hekike FYI as well since I know you had contributes some changes to OpenCensus as well.

hekike · 2019-05-29T08:27:56Z

Yes, at the moment we are experimenting with using OpenCensus at Netflix for our Node platform. I would be interested in such a runtime metrics exporter, especially now that OpenTelemetry is happening. I could imagine an official solution provided by Node Core to instrument core modules like HTTP and extract performance metrics instead of OpenCensus's current monkey patching approach. (low-cost, stable built-in context propagation would be also awesome)
cc @mayurkale22

About the original question, I can ask around what data can we share with a community from our use-cases. We run various API workloads with technologies like GraphQL, gRPC, React rendering, etc. on both Netflix scale and smaller internal / partner scale.

Flarna · 2019-05-29T08:48:56Z

I think we should distinguish here between metrics and tracing.
My understanding of metrics here are timeseries values like memory consumption, CPU load, GC time, EventLoop timings,... whereas traces are based on transactions linked together. Adding something like HTTP response times as a timeseries is often not that useful as different requests have different values so mixing them is not nice.

Adding metrics as described above into node core should be quite easy as reporting one metric has no side impact to the other and there is no needed to have a "world wide" working context passing in place.

Traces/Transactions are a lot harder as they require working context passing which in turn depends on user modules.

hekike · 2019-05-29T09:42:46Z

@Flarna the context is here that OpenTelemetry can extract both timeseries and tracing data from a process and report it to various backends. On your statement that timeseries data is always transaction independent I slightly disagree. For example for downstream RPCs, it can be useful to breakdown metrics by down/upstream dependencies.

Here are some examples:

https://github.com/RisingStack/opentracing-metrics-tracer
https://github.com/RisingStack/opentracing-infrastructure-graph
https://github.com/census-instrumentation/opencensus-node/tree/master/packages/opencensus-exporter-prometheus
Often you don't have a service mesh like in Istio (or proxy instrumentation in public clouds) to get these data: https://istio.io/docs/tasks/telemetry/kiali/

So 👍 on separation, but I do expect in the future these topics will bleed into each other.

Flarna · 2019-05-29T10:34:08Z

Fully agree here.

My main point was that there a lot interesting but transaction independent metrics out there which could be reported already now without waiting on the long running topic regarding context passing in Node.JS being closed (which actually may never happen at all).

As discussed in nodejs/diagnostics#161, the core should expose important metrics about the runtime, this PR's goal is to let user get the number of io request made, and lower level mertrics like the page faults and context switches.

As discussed in nodejs/diagnostics#161, the core should expose important metrics about the runtime, this PR's goal is to let user get the number of io request made, and lower level mertrics like the page faults and context switches. PR-URL: #28018 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>

github-actions · 2020-07-18T00:37:08Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

github-actions · 2022-08-04T00:26:52Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

mmarchini mentioned this issue Apr 25, 2018

Strategic Initiatives/Champions like we have on TSC #185

Closed

vmarchaud mentioned this issue Jun 2, 2019

process: expose uv_rusage on process.resourceUsage() nodejs/node#28018

Closed

4 tasks

tynes mentioned this issue Nov 4, 2019

http: add bindings to GET / bcoin-org/bcoin#901

Open

github-actions bot added the stale label Jul 18, 2020

mmarchini added never stale triaging-requested Issues that need extra attention before deciding if they should be closed or merged together and removed stale labels Jul 21, 2020

github-actions bot added the stale label Aug 4, 2022

github-actions bot closed this as completed Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance profiles for typical Node.js applications #161

Performance profiles for typical Node.js applications #161

joyeecheung commented Feb 14, 2018 •

edited

Loading

mike-kaufman commented Feb 14, 2018

mcollina commented Feb 16, 2018

mike-kaufman commented Feb 16, 2018

joyeecheung commented Feb 17, 2018 •

edited

Loading

joyeecheung commented Feb 17, 2018 •

edited

Loading

joyeecheung commented Feb 17, 2018

danielkhan commented Feb 19, 2018

jkrems commented Feb 19, 2018

mike-kaufman commented Feb 19, 2018

danielkhan commented Feb 19, 2018 •

edited

Loading

joyeecheung commented Feb 20, 2018 •

edited

Loading

danielkhan commented Feb 20, 2018

mike-kaufman commented Feb 20, 2018

mcollina commented Feb 28, 2018

mike-kaufman commented Feb 28, 2018

vmarchaud commented May 28, 2019

mhdawson commented May 28, 2019

mhdawson commented May 28, 2019

hekike commented May 29, 2019 •

edited

Loading

Flarna commented May 29, 2019

hekike commented May 29, 2019 •

edited

Loading

Flarna commented May 29, 2019

github-actions bot commented Jul 18, 2020

github-actions bot commented Aug 4, 2022

Performance profiles for typical Node.js applications #161

Performance profiles for typical Node.js applications #161

Comments

joyeecheung commented Feb 14, 2018 • edited Loading

mike-kaufman commented Feb 14, 2018

mcollina commented Feb 16, 2018

mike-kaufman commented Feb 16, 2018

joyeecheung commented Feb 17, 2018 • edited Loading

joyeecheung commented Feb 17, 2018 • edited Loading

joyeecheung commented Feb 17, 2018

danielkhan commented Feb 19, 2018

jkrems commented Feb 19, 2018

mike-kaufman commented Feb 19, 2018

danielkhan commented Feb 19, 2018 • edited Loading

joyeecheung commented Feb 20, 2018 • edited Loading

danielkhan commented Feb 20, 2018

mike-kaufman commented Feb 20, 2018

mcollina commented Feb 28, 2018

mike-kaufman commented Feb 28, 2018

vmarchaud commented May 28, 2019

mhdawson commented May 28, 2019

mhdawson commented May 28, 2019

hekike commented May 29, 2019 • edited Loading

Flarna commented May 29, 2019

hekike commented May 29, 2019 • edited Loading

Flarna commented May 29, 2019

github-actions bot commented Jul 18, 2020

github-actions bot commented Aug 4, 2022

joyeecheung commented Feb 14, 2018 •

edited

Loading

joyeecheung commented Feb 17, 2018 •

edited

Loading

joyeecheung commented Feb 17, 2018 •

edited

Loading

danielkhan commented Feb 19, 2018 •

edited

Loading

joyeecheung commented Feb 20, 2018 •

edited

Loading

hekike commented May 29, 2019 •

edited

Loading

hekike commented May 29, 2019 •

edited

Loading