Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LiveMetrics] report process metrics CPU Total and Committed Memory #42213

Merged
merged 13 commits into from
Mar 5, 2024

Conversation

TimothyMothra
Copy link
Contributor

@TimothyMothra TimothyMothra commented Feb 27, 2024

Changes

Testing

I've collected screenshots and detailed description for each counter.
For these metrics, I'm trying to match what Visual Studio's Debugger shows.

Memory (click to expand)

To test I debugged an app in Visual Studio and compared to the Process Memory.

Visual Studio Debugger

This shows roughly 50 MB of Process Memory
Screenshot 2024-02-23 140056

OTel uses:

This gives a value almost double of what Visual Studio Debugger is showing:
Screenshot 2024-02-26 183506

AI SDK uses:

This value closely matches Visual Studio's Diagnostic Tools:
Screenshot 2024-02-23 140153

CPU (click to expand)

To test CPU I compute squareroots in a loop for 20 seconds.

Visual Studio's Debugger:

Here, CPU jumps to 100% and stays high for a period before dropping back to 0%.
Screenshot 2024-02-23 143342

Otel uses

Note that this grows and never resets to zero. That flat line at the end is when the compute finished.
Screenshot 2024-02-23 143727

AI SDK uses

TotalProcessorTime is defined as the sum of UserProcessorTime and PrivilegedProcessorTime. So this is the same metrics used by OTel. But AI SDK has a specific algorithm to calculate the difference.

Note that this more closely matches Visual Studio
Screenshot 2024-02-23 144203

@github-actions github-actions bot added Monitor - Distro Monitor OpenTelemetry Distro Monitor - LiveMetrics labels Feb 27, 2024
@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

@TimothyMothra TimothyMothra marked this pull request as ready for review February 28, 2024 22:55
@rajkumar-rangaraj

This comment was marked as resolved.

@TimothyMothra TimothyMothra changed the title [LiveMetrics] report process counters for perf counters [LiveMetrics] report process counters CPU Total and Committed Memory Feb 29, 2024
@TimothyMothra TimothyMothra changed the title [LiveMetrics] report process counters CPU Total and Committed Memory [LiveMetrics] report process metrics CPU Total and Committed Memory Feb 29, 2024
@TimothyMothra
Copy link
Contributor Author

@cijothomas @rajkumar-rangaraj
I've addressed all feedback, please re-review.
I'll be out the rest of the week. If everything looks good feel free to merge. :)

@cijothomas
Copy link
Contributor

The screenshot in the description has one with CPU showing above 100%.. That should never happen, if you are sending normalized values.

@reyang
Copy link
Member

reyang commented Feb 29, 2024

The screenshot in the description has one with CPU showing above 100%.. That should never happen, if you are sending normalized values.

I suggest that we evaluate what .NET Extension did and align on the algorithm. At minimum, if we want to introduce our own implementation, I want the algorithm to be carefully designed, reviewed (and backed by Math) and documented (e.g. this PR sets a good example). https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/Linux/LinuxUtilizationProvider.cs

@cijothomas
Copy link
Contributor

The screenshot in the description has one with CPU showing above 100%.. That should never happen, if you are sending normalized values.

I suggest that we evaluate what .NET Extension did and align on the algorithm. At minimum, if we want to introduce our own implementation, I want the algorithm to be carefully designed, reviewed (and backed by Math) and documented (e.g. this PR sets a good example). https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.Diagnostics.ResourceMonitoring/Linux/LinuxUtilizationProvider.cs

Agree mostly. My thoughts are

  1. AzMon distro or exporter should not be in the business of producing metrics. It should just bring in existing instrumentation libraries only. And so far, we seem to be sticking with this approach.
  2. This particular PR is for the LiveMetrics feature. While it'd be nice to follow above here too, it may not be feasible/worth-the-effort. Considering LiveMetrics feature is not to store metrics anywhere permanent, the efforts to design and maintain an algorithm may not be worth it. (Its just for quick visual indication, not even queryable for later consumption, so carefully find the balance and avoid spending too much effort here)
  3. We still need to do the basic accuracy part - i.e do not incorrectly tell what the CPU utilization is that of the machine, while actually showing just the single process cpu utilization. Do not state that the cpu utilization is normalized to take into account the number of cores, but send a non-normalized value. (these requires product changes too). (In fact, ApplicationInsights SDKs are lying in many cases - eg: In Linux or in AppService environmetns, app insights only sends process cpu and memory, but the UI shows it as Total Machine CPU and Memory, which is wrong, and that mistake should not be repeated here.) Again, this needs product change/support (And Mothra is already engaged with Product team to get the necessary changes)

@rajkumar-rangaraj
Copy link
Contributor

Providing a summary of the issue before we exploring into the specifics:

In the Application Insights SDK, for the Windows environment, CPU metrics are collected for the entire operating system. In contrast, for Azure App Service and other non-Windows environments, process-level CPU metrics are utilized. When it comes to memory metrics, Windows uses the performance counter \Memory\Committed Bytes, while other environments rely on process-level private bytes.

Languages other than .NET typically read the OS-level counter to send these metrics, leading to inconsistencies in the data sent across different environments and languages.

Mothra has initiated a collaboration with the services team to establish clear guidelines on what exactly should be sent. Our objective is to ensure that the same metrics are sent from all environments and languages.

The services team has requested additional time to formalize this process. As an interim solution, both service and SDK teams have agreed to use process-level metrics for both CPU and memory. This temporary solution will be implemented in beta versions, and we will not release a Release Candidate (RC) or General Availability (GA) version until the standardization is finalized. This approach allows us to continue our work on filters in the meantime.

AzMon distro or exporter should not be in the business of producing metrics. It should just bring in existing instrumentation libraries only. And so far, we seem to be sticking with this approach.

This is correct, except for standard metrics. For live metrics, we don't want to take the dependency on the OpenTelemetry Metric SDK, as it will make the implementation very complex.

This particular PR is for the LiveMetrics feature. While it'd be nice to follow above here too, it may not be feasible/worth-the-effort.

Correct, it is not worth the effort.

Considering LiveMetrics feature is not to store metrics anywhere permanent, the efforts to design and maintain an algorithm may not be worth it. (Its just for quick visual indication, not even queryable for later consumption, so carefully find the balance and avoid spending too much effort here)

I have to agree with Cijo on this part. The algorithm used is very straightforward, and I don't see any issues with it. The LinuxUtilizationProvider uses the same approach..

The screenshot in the description has one with CPU showing above 100%.. That should never happen, if you are sending normalized values.

This is not true. Mothra is trying to explain that using Process.UserProcessorTime plots an incorrect graph, and a small implementation is needed to solve it. He provided the accurate graph at the end of the description and has also validated the graph by onboarding the runtime metrics to the same app.

Finally, I need to acknowledge that we need to accurately capture the information in the changelogs.

@cijothomas
Copy link
Contributor

The screenshot in the description has one with CPU showing above 100%.. That should never happen, if you are sending normalized values.

This is not true. Mothra is trying to explain that using Process.UserProcessorTime plots an incorrect graph, and a small implementation is needed to solve it. He provided the accurate graph at the end of the description and has also validated the graph by onboarding the runtime metrics to the same app.

Thanks! Yes I can see the description is pretty clear.

@TimothyMothra TimothyMothra enabled auto-merge (squash) March 5, 2024 00:57
@TimothyMothra TimothyMothra merged commit 6948051 into main Mar 5, 2024
15 checks passed
@TimothyMothra TimothyMothra deleted the tilee/livemetric_perfcounters_alt branch March 5, 2024 00:57
angiurgiu pushed a commit that referenced this pull request Mar 20, 2024
…y` (#42213)

* implement process counters

* use Demo for testing. REVERT THIS BEFORE MERGING

* fix typo

* changelog

* Revert "use Demo for testing. REVERT THIS BEFORE MERGING"

This reverts commit 38bf3f3.

* cleanup

* update changelog

* refactor

* typo

* rename

* changelog

* rename

* pr feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Monitor - Distro Monitor OpenTelemetry Distro Monitor - LiveMetrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants