Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【please help】Unmanaged memory only increases but does not decrease #6556

Closed
lfzm opened this issue May 27, 2020 · 38 comments
Closed

【please help】Unmanaged memory only increases but does not decrease #6556

lfzm opened this issue May 27, 2020 · 38 comments
Assignees
Labels
stale Issues with no activity for the past 6 months
Milestone

Comments

@lfzm
Copy link
Contributor

lfzm commented May 27, 2020

Orleans version:v3.1.7
.NET Core : v3.1

image

Can provide DotMemory snapshot。

@ReubenBond
Copy link
Member

Perhaps the GC is not releasing that memory back to Windows. What are your GC settings, is ServerGC on?

@lfzm
Copy link
Contributor Author

lfzm commented May 28, 2020

@ReubenBond Yes, turn on ServerGC

@HermesNew
Copy link
Contributor

Turn off the ServerGC,maybe solve the issue.

@lfzm
Copy link
Contributor Author

lfzm commented May 28, 2020

It can be solved by shutting down ServerGC. Does Orleans plan memory space in advance for performance?

@ReubenBond
Copy link
Member

Orleans does not perform any special unmanaged memory allocations in order to reserve space. There are some buffer pools, but those are managed and grow dynamically.

@pipermatt
Copy link
Contributor

I may be having this same issue, though I am on Orleans v3.1.6. I had upgraded from Orleans v.3.1.0 ...

image

I'll let you guess where the deploy and the rollback were... 😂

@sergeybykov
Copy link
Contributor

@pipermatt What's the deployment environment here - Windows/Linux, .NET Core/Full Framework, which version?

I don't see anything in the fixes between 3.1.0 and 3.1.6 that could obviously change memory allocation profile. Did you upgrade anything else at the same time by chance?

Have you tried taking and analyzing memory dumps to see what the memory is used by?

@pipermatt
Copy link
Contributor

Linux, .NET Core 3.1... On average memory utilization seems to increase at a rate of about 15MB/hr... and there's just SO much allocated it's difficult to wade through it all via command line in Linux. I'm not an expert in the profiling tools, that's for sure.

I seem to be able to reproduce the behavior locally on my MBP as well... but dotnet-dump doesn't support Mac OS X. 😏 So I've been ssh'ing into a test Linux instance to try to diagnose. Tomorrow I may grab a Windows machine so I have the full benefit of PerfView, dotTrace, etc... but first, since I can reproduce locally, I'm methodically stripping down our configuration to as barebones as possible one feature at a time.

We did upgrade several other libraries that are called by our grain code, but the memory leak is apparent on an idle silo that isn't taking any traffic and doesn't have any of our grains instantiated yet.

We'll get it figured out... and will report back. 👍

@pipermatt
Copy link
Contributor

pipermatt commented Jun 3, 2020

After stripping my silo of features until it was about as basic as possible, I came to the conclusion that what I was seeing locally was a red herring and not indicative of the problem I saw in production. On a whim, I rolled forward to the release that deployed just before the available memory tanked...

image

You can see a dip where the deploy happened for each silo node, but it is humming along just fine. So now without a real reproduction case, I'm going to have to shelve this investigation unless it rears it's head again.

@sergeybykov
Copy link
Contributor

Interesting. Did you also upgrade to 3.1.7?

@pipermatt
Copy link
Contributor

I have not yet, tho I think I also spoke too soon... memory is going down again steadily... which matches the rate it did before (the first graph was zoomed out to a much larger time range)...
image

@sergeybykov
Copy link
Contributor

If memory does indeed leak over time, need to look at the memory dumps or GC profiles. @ReubenBond might have suggestion how to do that in an non-invasive manner.

@pipermatt
Copy link
Contributor

Yeah, I'm working that angle now, though not on the production servers (yet). I think I'm seeing the exact same behavior with this build in my TEST environment, so I'm working on memory dumps there.

@pipermatt
Copy link
Contributor

Update: there was another difference discovered. ;)

The version that appeared to be leaking memory had its LogLevel set to Debug... we had been running at LogLevel.Information previously. We weren't actually seeing a memory leak... we were seeing Linux allocate more and more memory to disk caching to buffer the writes to the system journal. This memory was always reclaimed when the Silo needed it, though this process itself was slow enough that we would see a spike of errors while it happened.

The tidbit that we didn't understand was why on redeploy, not ALL the memory that had been used was freed. Now, it makes perfect sense... because the silo process wasn't the one using it at all... Linux itself was. Eventually the OS decreased the cache allocation when we rolled back to the version that had LogLevel.Info and it no longer needed so much memory caching to keep up with the journal writes.

Mystery solved!

@sergeybykov
Copy link
Contributor

Thank you for the update, @pipermatt! Makes perfect sense. This reminds me again how often misconfigured logging may cause non-obvious issues.

@lfzm Have you resolved your problem? Can we close this issue now?

@HermesNew
Copy link
Contributor

@pipermatt Haha,Mystery has not been solved.
My LogLevel is Warn.Before turn off ServerGC, the memory is between 1.1G and 1.5G.After turn off ServerGC, the memory is between 320M and 350M.
At present, I solve the problem of consuming a lot of memory by turn off ServiceGC. This problem has been around for a long time.

@ReubenBond
Copy link
Member

@HermesNew I believe this is most likely a ServerGC (.NET Core) concern, rather than something specific to Orleans. It might be worth looking at the various GC settings in the documentation here: https://docs.microsoft.com/en-us/dotnet/core/run-time-config/garbage-collector#systemgcretainvmcomplus_gcretainvm, in particular, RetainVM might be of interest

@lfzm
Copy link
Contributor Author

lfzm commented Jun 9, 2020

@ReubenBond Start a simple Orleans, use JetBrains dotMemory to detect that there will be Unmanaged memory. So I suspect Orleans’ problem

@ReubenBond
Copy link
Member

Not necessarily. The GC deals with unmanaged memory, Orleans does not

@HermesNew
Copy link
Contributor

@ReubenBond
Maybe this GC setting is the best setting.
<ServerGarbageCollection>false</ServerGarbageCollection> <ConcurrentGarbageCollection>true</ConcurrentGarbageCollection>

@ReubenBond
Copy link
Member

I don't recommend it. I recommend keeping server GC enabled if you are running in production. Are you running in a Linux container? You can set a limit on the maximum amount of memory used if you want. Note that ServerGC uses one heap per core by default, but you can reduce that using another setting.

@SebastianStehle
Copy link
Contributor

With .NET Core 3.0, the runtime should just respect the cgroup limits.

@ReubenBond
Copy link
Member

ReubenBond commented Jun 9, 2020

Yep, by default it will allow up to 75% of the cgroup memory limit. CPU limits also play a part in determining the number of heaps. In this case, I think it's probably running on windows, but I'm not sure.

@HermesNew
Copy link
Contributor

@ReubenBond Is in production,running on windows server 2012R2.
It is working very well now, and memory usage is well controlled after turn off ServerGC.

BTW: Orleans version:v3.1.7,.NET Core : v3.1

@HermesNew
Copy link
Contributor

@ReubenBond I am now preparing to migrate to the linux container.So I want to know the best settings. Based on current practice, this setting is optimal.

@ReubenBond
Copy link
Member

Is that unmanaged memory causing the application to terminate? Does it grow forever, or just for a few hours? I would imagine that things hit a steady state rather quickly?

@HermesNew
Copy link
Contributor

The greater the load, the greater the memory consumption, and the memory will not decrease.It will causing the application to terminate.It will cause OOM Exception.

@ReubenBond
Copy link
Member

Are you saying that you are seeing OOM exceptions?

@HermesNew
Copy link
Contributor

When the application terminate will throw oom exception.I have analyzed the dump file,mainly
unmanaged memory.My program has no memory leak.
So this is what puzzles me.

@HermesNew
Copy link
Contributor

@ReubenBond Start a simple Orleans, use JetBrains dotMemory to detect that there will be Unmanaged memory. So I suspect Orleans’ problem

According to @lfzm methods, this problem can reappear.

@ReubenBond
Copy link
Member

Can you share the crash dump?

@Cloud33
Copy link

Cloud33 commented Jun 11, 2020

I'm here
https://blog.markvincze.com/troubleshooting-high-memory-usage-with-asp-net-core-on-kubernetes/

It is seen that it seems that because dotnet recognizes the number of CPUs in Docker there is an error, if you use Server GC, it will consume a lot of memory, in Docker it is recommended to turn off the Server GC, use Workstation GC

  <PropertyGroup> 
    <ServerGarbageCollection>false</ServerGarbageCollection>
  </PropertyGroup>

Have you heard that there seems to be a problem with CPU recognition errors?
😉
dotnet/runtime#11933

@HermesNew
Copy link
Contributor

The dump file is large.
I turn off the ServerGC.At present, there is no problem of excessive memory usage.

@ReubenBond
Copy link
Member

@Cloud33 that advice no longer applies. The GC recognises CPU limits present in the container and adjusts heap count accordingly. Additionally, you can set the memory limit (and it's also detected from the container's cgroup).

@HermesNew You can set a memory limit if you want. If you do, do you still see OOM exceptions? How long does the application run for before crashing with an OOM?

@Cloud33
Copy link

Cloud33 commented Jun 11, 2020

@ReubenBond Ok

@srollinet
Copy link
Contributor

srollinet commented Oct 15, 2020

We are experiencing memory issues in production on 2 linux servers running an Orleans cluster. For now, we don't know if it is related to Orleans or not.

EDIT

oops, my bad it wasn't the processes I thought that were eating the memory... I should learn how to read ps results in linux :P, sorry for the post...

@ReubenBond ReubenBond self-assigned this Oct 20, 2020
@ReubenBond ReubenBond added this to the Triage milestone Oct 20, 2020
@ghost ghost added the stale Issues with no activity for the past 6 months label Dec 7, 2021
@ghost
Copy link

ghost commented Dec 7, 2021

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

@ghost
Copy link

ghost commented Mar 4, 2022

This issue has been marked stale for the past 30 and is being closed due to lack of activity.

@ghost ghost closed this as completed Mar 4, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Apr 4, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale Issues with no activity for the past 6 months
Projects
None yet
Development

No branches or pull requests

8 participants