【please help】Unmanaged memory only increases but does not decrease #6556

lfzm · 2020-05-27T07:11:09Z

Orleans version：v3.1.7
.NET Core : v3.1

Can provide DotMemory snapshot。

ReubenBond · 2020-05-27T13:21:44Z

Perhaps the GC is not releasing that memory back to Windows. What are your GC settings, is ServerGC on?

lfzm · 2020-05-28T01:41:08Z

@ReubenBond Yes, turn on ServerGC

HermesNew · 2020-05-28T08:42:15Z

Turn off the ServerGC,maybe solve the issue.

lfzm · 2020-05-28T13:25:05Z

It can be solved by shutting down ServerGC. Does Orleans plan memory space in advance for performance?

ReubenBond · 2020-05-30T00:00:37Z

Orleans does not perform any special unmanaged memory allocations in order to reserve space. There are some buffer pools, but those are managed and grow dynamically.

pipermatt · 2020-06-02T21:48:19Z

I may be having this same issue, though I am on Orleans v3.1.6. I had upgraded from Orleans v.3.1.0 ...

I'll let you guess where the deploy and the rollback were... 😂

sergeybykov · 2020-06-02T22:16:13Z

@pipermatt What's the deployment environment here - Windows/Linux, .NET Core/Full Framework, which version?

I don't see anything in the fixes between 3.1.0 and 3.1.6 that could obviously change memory allocation profile. Did you upgrade anything else at the same time by chance?

Have you tried taking and analyzing memory dumps to see what the memory is used by?

pipermatt · 2020-06-03T00:18:49Z

Linux, .NET Core 3.1... On average memory utilization seems to increase at a rate of about 15MB/hr... and there's just SO much allocated it's difficult to wade through it all via command line in Linux. I'm not an expert in the profiling tools, that's for sure.

I seem to be able to reproduce the behavior locally on my MBP as well... but dotnet-dump doesn't support Mac OS X. 😏 So I've been ssh'ing into a test Linux instance to try to diagnose. Tomorrow I may grab a Windows machine so I have the full benefit of PerfView, dotTrace, etc... but first, since I can reproduce locally, I'm methodically stripping down our configuration to as barebones as possible one feature at a time.

We did upgrade several other libraries that are called by our grain code, but the memory leak is apparent on an idle silo that isn't taking any traffic and doesn't have any of our grains instantiated yet.

We'll get it figured out... and will report back. 👍

pipermatt · 2020-06-03T17:27:59Z

After stripping my silo of features until it was about as basic as possible, I came to the conclusion that what I was seeing locally was a red herring and not indicative of the problem I saw in production. On a whim, I rolled forward to the release that deployed just before the available memory tanked...

You can see a dip where the deploy happened for each silo node, but it is humming along just fine. So now without a real reproduction case, I'm going to have to shelve this investigation unless it rears it's head again.

sergeybykov · 2020-06-03T20:25:55Z

Interesting. Did you also upgrade to 3.1.7?

pipermatt · 2020-06-03T21:22:04Z

I have not yet, tho I think I also spoke too soon... memory is going down again steadily... which matches the rate it did before (the first graph was zoomed out to a much larger time range)...

sergeybykov · 2020-06-03T21:28:14Z

If memory does indeed leak over time, need to look at the memory dumps or GC profiles. @ReubenBond might have suggestion how to do that in an non-invasive manner.

pipermatt · 2020-06-03T21:31:48Z

Yeah, I'm working that angle now, though not on the production servers (yet). I think I'm seeing the exact same behavior with this build in my TEST environment, so I'm working on memory dumps there.

pipermatt · 2020-06-05T23:50:19Z

Update: there was another difference discovered. ;)

The version that appeared to be leaking memory had its LogLevel set to Debug... we had been running at LogLevel.Information previously. We weren't actually seeing a memory leak... we were seeing Linux allocate more and more memory to disk caching to buffer the writes to the system journal. This memory was always reclaimed when the Silo needed it, though this process itself was slow enough that we would see a spike of errors while it happened.

The tidbit that we didn't understand was why on redeploy, not ALL the memory that had been used was freed. Now, it makes perfect sense... because the silo process wasn't the one using it at all... Linux itself was. Eventually the OS decreased the cache allocation when we rolled back to the version that had LogLevel.Info and it no longer needed so much memory caching to keep up with the journal writes.

Mystery solved!

sergeybykov · 2020-06-08T21:31:52Z

Thank you for the update, @pipermatt! Makes perfect sense. This reminds me again how often misconfigured logging may cause non-obvious issues.

@lfzm Have you resolved your problem? Can we close this issue now?

HermesNew · 2020-06-09T01:22:47Z

@pipermatt Haha,Mystery has not been solved.
My LogLevel is Warn.Before turn off ServerGC, the memory is between 1.1G and 1.5G.After turn off ServerGC, the memory is between 320M and 350M.
At present, I solve the problem of consuming a lot of memory by turn off ServiceGC. This problem has been around for a long time.

ReubenBond · 2020-06-09T01:30:16Z

@HermesNew I believe this is most likely a ServerGC (.NET Core) concern, rather than something specific to Orleans. It might be worth looking at the various GC settings in the documentation here: https://docs.microsoft.com/en-us/dotnet/core/run-time-config/garbage-collector#systemgcretainvmcomplus_gcretainvm, in particular, RetainVM might be of interest

lfzm · 2020-06-09T03:47:01Z

@ReubenBond Start a simple Orleans, use JetBrains dotMemory to detect that there will be Unmanaged memory. So I suspect Orleans’ problem

ReubenBond · 2020-06-09T03:47:58Z

Not necessarily. The GC deals with unmanaged memory, Orleans does not

HermesNew · 2020-06-09T04:15:44Z

@ReubenBond
Maybe this GC setting is the best setting.
<ServerGarbageCollection>false</ServerGarbageCollection> <ConcurrentGarbageCollection>true</ConcurrentGarbageCollection>

ReubenBond · 2020-06-09T04:19:49Z

I don't recommend it. I recommend keeping server GC enabled if you are running in production. Are you running in a Linux container? You can set a limit on the maximum amount of memory used if you want. Note that ServerGC uses one heap per core by default, but you can reduce that using another setting.

SebastianStehle · 2020-06-09T04:38:24Z

With .NET Core 3.0, the runtime should just respect the cgroup limits.

ReubenBond · 2020-06-09T04:59:00Z

Yep, by default it will allow up to 75% of the cgroup memory limit. CPU limits also play a part in determining the number of heaps. In this case, I think it's probably running on windows, but I'm not sure.

HermesNew · 2020-06-09T05:07:15Z

@ReubenBond Is in production,running on windows server 2012R2.
It is working very well now, and memory usage is well controlled after turn off ServerGC.

BTW: Orleans version：v3.1.7,.NET Core : v3.1

HermesNew · 2020-06-09T05:10:37Z

@ReubenBond I am now preparing to migrate to the linux container.So I want to know the best settings. Based on current practice, this setting is optimal.

ReubenBond · 2020-06-09T13:04:44Z

Is that unmanaged memory causing the application to terminate? Does it grow forever, or just for a few hours? I would imagine that things hit a steady state rather quickly?

HermesNew · 2020-06-10T05:23:03Z

The greater the load, the greater the memory consumption, and the memory will not decrease.It will causing the application to terminate.It will cause OOM Exception.

ReubenBond · 2020-06-10T05:51:03Z

Are you saying that you are seeing OOM exceptions?

HermesNew · 2020-06-10T05:55:27Z

When the application terminate will throw oom exception.I have analyzed the dump file，mainly
unmanaged memory.My program has no memory leak.
So this is what puzzles me.

HermesNew · 2020-06-10T06:20:32Z

@ReubenBond Start a simple Orleans, use JetBrains dotMemory to detect that there will be Unmanaged memory. So I suspect Orleans’ problem

According to @lfzm methods, this problem can reappear.

ReubenBond · 2020-06-10T12:31:08Z

Can you share the crash dump?

Cloud33 · 2020-06-11T03:13:20Z

I'm here
https://blog.markvincze.com/troubleshooting-high-memory-usage-with-asp-net-core-on-kubernetes/

It is seen that it seems that because dotnet recognizes the number of CPUs in Docker there is an error, if you use Server GC, it will consume a lot of memory, in Docker it is recommended to turn off the Server GC, use Workstation GC

  <PropertyGroup> 
    <ServerGarbageCollection>false</ServerGarbageCollection>
  </PropertyGroup>

Have you heard that there seems to be a problem with CPU recognition errors?
😉
dotnet/runtime#11933

HermesNew · 2020-06-11T03:25:20Z

The dump file is large.
I turn off the ServerGC.At present, there is no problem of excessive memory usage.

ReubenBond · 2020-06-11T03:37:57Z

@Cloud33 that advice no longer applies. The GC recognises CPU limits present in the container and adjusts heap count accordingly. Additionally, you can set the memory limit (and it's also detected from the container's cgroup).

@HermesNew You can set a memory limit if you want. If you do, do you still see OOM exceptions? How long does the application run for before crashing with an OOM?

Cloud33 · 2020-06-11T04:05:46Z

@ReubenBond Ok

srollinet · 2020-10-15T16:05:50Z

~~We are experiencing memory issues in production on 2 linux servers running an Orleans cluster. For now, we don't know if it is related to Orleans or not.~~

EDIT

oops, my bad it wasn't the processes I thought that were eating the memory... I should learn how to read ps results in linux :P, sorry for the post...

ghost · 2021-12-07T21:04:59Z

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

ghost · 2022-03-04T23:04:12Z

This issue has been marked stale for the past 30 and is being closed due to lack of activity.

ReubenBond self-assigned this Oct 20, 2020

ReubenBond added this to the Triage milestone Oct 20, 2020

ghost added the stale Issues with no activity for the past 6 months label Dec 7, 2021

ghost closed this as completed Mar 4, 2022

ghost locked as resolved and limited conversation to collaborators Apr 4, 2022

This issue was closed.

【please help】Unmanaged memory only increases but does not decrease #6556

【please help】Unmanaged memory only increases but does not decrease #6556

Comments

lfzm commented May 27, 2020

ReubenBond commented May 27, 2020

lfzm commented May 28, 2020

HermesNew commented May 28, 2020

lfzm commented May 28, 2020

ReubenBond commented May 30, 2020

pipermatt commented Jun 2, 2020

sergeybykov commented Jun 2, 2020

pipermatt commented Jun 3, 2020

pipermatt commented Jun 3, 2020 • edited Loading

sergeybykov commented Jun 3, 2020

pipermatt commented Jun 3, 2020

sergeybykov commented Jun 3, 2020

pipermatt commented Jun 3, 2020

pipermatt commented Jun 5, 2020

sergeybykov commented Jun 8, 2020

HermesNew commented Jun 9, 2020

ReubenBond commented Jun 9, 2020

lfzm commented Jun 9, 2020

ReubenBond commented Jun 9, 2020

HermesNew commented Jun 9, 2020

ReubenBond commented Jun 9, 2020

SebastianStehle commented Jun 9, 2020

ReubenBond commented Jun 9, 2020 • edited Loading

HermesNew commented Jun 9, 2020

HermesNew commented Jun 9, 2020

ReubenBond commented Jun 9, 2020

HermesNew commented Jun 10, 2020

ReubenBond commented Jun 10, 2020

HermesNew commented Jun 10, 2020

HermesNew commented Jun 10, 2020

ReubenBond commented Jun 10, 2020

Cloud33 commented Jun 11, 2020 • edited Loading

HermesNew commented Jun 11, 2020

ReubenBond commented Jun 11, 2020

Cloud33 commented Jun 11, 2020

srollinet commented Oct 15, 2020 • edited Loading

ghost commented Dec 7, 2021

ghost commented Mar 4, 2022

pipermatt commented Jun 3, 2020 •

edited

Loading

ReubenBond commented Jun 9, 2020 •

edited

Loading

Cloud33 commented Jun 11, 2020 •

edited

Loading

srollinet commented Oct 15, 2020 •

edited

Loading