Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNAT PORT EXHAUSTION - possibly due to HttpClient #14616

Closed
MikeAlhayek opened this issue Oct 31, 2023 · 77 comments
Closed

SNAT PORT EXHAUSTION - possibly due to HttpClient #14616

MikeAlhayek opened this issue Oct 31, 2023 · 77 comments

Comments

@MikeAlhayek
Copy link
Member

MikeAlhayek commented Oct 31, 2023

I have 3 different Azure WebApps hosted in Azure. It may not matter, but I deploy my apps using docker containers. All of my app are connected to Azure SQL server instances (each app uses a different database), the same Shell blob storage, the same Media blob storage, the same Redis instance.

One site ran a report that pulls lots of content items, then does in aggregation to generate the final report. But this report, caused all 3 WebApps to got down at the same time! Yes all 3 different WebApps went down. not fun!

After lots of investigation, Microsoft team concluded that all of my web apps were a victim of SNAT PORT EXHAUTION. Here is more info about the issue and possible fixes.

What is SNAT?

SNAT means Source Network Address Translation. As you may already know, the Outbound/Inbound IP addresses correspond to the IP addresses of the public front end of your app's scale unit.

Whenever your application creates an outbound connection, SNAT will rewrite the source IP from the backend instance's IP address to the outbound public IP of the scale unit's frontend.

For example: If my app creates a connection to a SQL Database through public traffic, the SQL Server will see the frontend's outbound IP address as the client IP.

Each scale unit (or as some name it, stamp) has more than 65000 ports available for all of the instances that are in the same public stamp. Since this is a multi-tenant stamp, multiple users can be in the same stamp.

(Each user with their own instances allocated to themselves if they have a pricing tier of Basic or more).

Remember that you always have at least 128 ports preallocated to yourself, but this is not a limit and can be exceeded.

For each TCP connection to reuse an SNAT port, the 5 tuple has to be unique (TCP (or UDP), Source IP, Source Port, Destination IP, and Destination Port ).

Changing any of the destination information will of course break the connection and the source IP is the front end's outbound IP address, which we of course cannot change.

This means that we can only change the Destination port:

image

If the destination is different, we reuse the SNAT Port:

image

If all tuple is the same, a new port will be used:

For example, if an app opens one connection per second 1000 seconds to the same destination, it will use 1000 ports. On the other hand, if you open multiple connections to different destinations some of those connections will reuse the same port.

Based on your apps performance, you consumed over 474 SNAT Ports with 576 allocated on WebApp1

image

And you consumed over 474 SNAT Ports with 576 allocated on WebApp2

image

Both the same parameters

Also, if an app opens a connection to a database and database's server fails to send an ACK response back within 230 seconds, the connection will remain idle for 230 seconds causing other connections to remain in the queue which could possibly lead to SNAT port Exhaustion.

It is also important to know that whenever a client/server sends a FIN control flag, the TCP connection will remain in a TIME_WAIT state for 120 seconds and then free the SNAT Port. In other words, when the client or server acknowledges to close the connection, there is 120 seconds of 'cooldown' before freeing the SNAT Port

Note: Remember that multiple HTTP Requests can go on the same TCP connection.

What causes SNAT Port Exhaustion?:

  1. An application that opens a connection per HTTP request, to solve this you should open the TCP connection once and then reuse.
  2. Intense retry logic.
  3. An application with no connection pooling.
  4. Among others
    How can we prevent SNAT Port Exhaustion?
  5. Modify the application to reuse connections.
  6. Modify the application to use connection pooling, which consists on setting a fix
  7. ed set of connections (like 100 connections) which will be reused by the application.
  8. Modify the application to use less aggressive retry logic.
  9. Use sensible keepalives to reset the outbound idle timeout: https://msdn.microsoft.com/en-us/library/windows/desktop/ee470551(v=vs.85).aspx sending a packet through an idle connection every 30 seconds or 1 minute should be considered as it’ll reset any idle timers both in Azure and in on-premises firewalls.
  10. Integrate your app service in a Virtual Network and create outbound connections to private IPS instead of public IPs to avoid using the public stamp's SNAT Ports.

@jtkech I am using nightly previews in production. I am guessing this issue has to do with the recent improvements that you did around IHttpClientFactory.

@rjpowers10 @ShaneCourtrille @wAsnk have you tested out the preview in your production environment after last changed to the HttpClientFactory?

@sebastienros this is a weird one and you may have additional info.

@sebastienros
Copy link
Member

Interestingly, I believe the httpclient has a connections pool, so I found it weird we could create as many as we'd want with the handlers cache. So maybe we really should not release these, have them reused as a local field, or dispose them after their usage to have them free up the connections taken by the pool.

@sebastienros
Copy link
Member

Can't find the pool in the HttpClient per se, so must be in the handlers. Better reuse the httpclient instance then. It might not come from the RecaptchaService though, could be in another class where we inject HttpClient/HttpClientFactory incorrectly. Would be interesting to know what endpoints these are, or where the HttpClients are created (analyze trace/memory dump to find root instance)

@MikeAlhayek
Copy link
Member Author

@sebastienros I am not sure how to go about analyzing or obtaining memory dump from Azure WebApp. What is odd here is that the Post request to generate the report what cause this issue. The report query the database using ISession, then does it's logic in the memory. I don't think YesSQL uses HttpClient or the factory for anything. none of the code uses HttpClient "at least not for that reports". Count it be Redis? I am not using cache distribution directly for this report. Still not sure why this problem would occurs when I am using ISession to query data.

@jtkech
Copy link
Member

jtkech commented Oct 31, 2023

Was it working before? Do you have many tenants?

Does your report may trigger a tenant release? (Because the factory is now at the tenant level).

@MikeAlhayek
Copy link
Member Author

MikeAlhayek commented Oct 31, 2023

@jtkech I am not sure if it worked before. The report has been used for a while with no issue. but I am not sure if the report was generated before with a large amount of data.

We did not add new tenants (I think less than 10 tenants). We do not do a manual release of the tenant (unless it does it on it's own due to some sort of memory issues). The report fetches the data, created an ExcelPackage object (EPPlus library), and renders the ExcelPackage as HTML and allows the user to download the report using something like this

<a download="file.xlsx" href="data:application/octet-stream;base64,@Convert.ToBase64String(bytes)" class="btn btn-primary"><i class="fa-solid fa-download"></i> @T["Download To Excel"]</a>

@jtkech
Copy link
Member

jtkech commented Oct 31, 2023

possibly due to HttpClient

Or creating new db connections, to build your report are you doing many requests and/or in the code creating a new shell scope in a loop?

@sebastienros
Copy link
Member

In the Kudu tools you can create a full memory dump, then open it in Visual Studio and look for HttClient or SqlConnection instances.

@MikeAlhayek
Copy link
Member Author

@jtkech

Or creating new db connections, to build your report are you doing many requests and/or in the code creating a new shell scope in a loop?

No scopes here. Here is the report logic "which works for small date range (less data)"

  1. do permission check
  2. use DisplayManager.UpdateEditorAsync() to build the shapes for the filters. this also populates the filters using conditions parameter
    public class QueryParametersBaseViewModel : Entity
    {
        [BindNever]
        public List<Func<IQuery<ContentItem>, IQuery<ContentItem>>> Conditions { get; private set; } = new List<Func<IQuery<ContentItem>, IQuery<ContentItem>>>();
    }
  1. Then I execute the query like this
var query = _session.Query<ContentItem>();
var interviews = await query.All(viewModel.Conditions.ToArray()).ListAsync();
  1. I loop over the contentItems and extract the data that I want from parts/fields...
  2. Save the data into ExcelPackage
  3. Pass to the view which create the HTML based on the content of the ExcelPackage.

@jtkech
Copy link
Member

jtkech commented Nov 1, 2023

Didn't see any HttpClient that would open many connections, and normally we use the same Database connection in a given request unless new sessions/transactions are created explicitly. Hmm, there is also BlobClient to retrieve media info but normally not used if you only build editor shapes, there is also Redis connections but normally they are shared, also when using the oob RedisXmlRepository (for distributed data protection) for which we provide a connection factory.

Are you using the distributed tenants feature that runs in the background? Related to this

each app uses ... the same Redis instance.

But with different instance prefixes I assume

We also introduced media cleanup background tasks but they run once a day around midnight UTC.

@wAsnk
Copy link
Contributor

wAsnk commented Nov 1, 2023

@MikeAlhayek yeah we have it in prod since that preview release, but we didn't meet with this problem.

@rjpowers10
Copy link
Contributor

@MikeAlhayek no, I haven't tried the preview release

@MikeAlhayek
Copy link
Member Author

@jtkech

But with different instance prefixes I assume

Yes this is something does by OC by default.

Are you using the distributed tenants feature that runs in the background? Related to this

Yes I am. This feature is a host level feature only.

What is strange is that when I send the request to generate the report, all web apps stop responding.

I have a background task that runs every minute and process a queue (that is stored in the database). I don't think it would could an issue but just sharing this with you in case you see an issue with it. This background process places an explicate lock to ensure it only runs on one node. Here is the background service

[BackgroundTask(
    Title = "Queue Processor",
    Schedule = "* * * * *",
    Description = "Processes queue elements from the queue.",
    UsePipeline = true)]
public class QueueProcessorBackgroundTask : IBackgroundTask
{
    private static string _lockerKey = typeof(QueueProcessorBackgroundTask).FullName;

    public async Task DoWorkAsync(IServiceProvider serviceProvider, CancellationToken cancellationToken)
    {
        // To prevent multiple instances running in parallel on multiple containers, lets create a lock.
        var distributedLock = serviceProvider.GetRequiredService<IDistributedLock>();
        var shellSettings = serviceProvider.GetRequiredService<ShellSettings>();


        var (locker, locked) = await distributedLock.TryAcquireLockAsync(_lockerKey + shellSettings.Name, TimeSpan.FromMicroseconds(2000));

        if (!locked)
        {
            // Could not place a lock. This means that another instance is already running.
            return;
        }

        // It is important to use a using block here so that the lock is released at the end.
        await using var acquiredLock = locker;

        var queueStore = serviceProvider.GetRequiredService<IQueueStore>();
        var clock = serviceProvider.GetRequiredService<IClock>();
        var localClock = serviceProvider.GetRequiredService<ILocalClock>();

        var startedAtUtc = clock.UtcNow;

        // Get queue elements that are due.
        var items = await queueStore.GetDueAsync(startedAtUtc);

        if (!items.Any())
        {
            // Nothing to process.
            return;
        }

        var groups = items.GroupBy(x => x.QueueName)
            .Select(x => new
            {
                QueueType = x.Key,
                Context = new QueueElementProcessingContext()
                {
                    Elements = x.ToList(),
                },
            }).ToList();

        var processors = serviceProvider.GetServices<IQueueElementProcessor>();
        var logger = serviceProvider.GetRequiredService<ILogger<QueueProcessorBackgroundTask>>();

        foreach (var group in groups)
        {
            var processor = processors.FirstOrDefault(processor => processor.CanHandle(group.QueueType));

            if (processor == null)
            {
                logger.LogWarning("Unable to find a queue element processor that is able to handle the '{queueType}' QueueType.", group.QueueType);

                continue;
            }

            logger.LogTrace("QUEUE-ELEMENT: About to process queue-elements for '{queueType}' Queue-type.", group.QueueType);

            await processor.ProcessAsync(group.Context);

            foreach (var element in group.Context.Elements)
            {
                if (element.IsRemovable)
                {
                    logger.LogTrace("REMOVE! An item was flagged as removable after it was processed: '{elementId}', CorrelationId: '{correlationId}'.", element.Id, element.CorrelationId);

                    // Since this item was marked for removal, delete it.
                    await queueStore.DeleteAsync(element);

                    continue;
                }

                element.Counter++;

                if (!element.IsProcessed)
                {
                    // At this point, we know that the item was not processed.
                    // Since a failed attempt was made, let's first increment the failed counter.
                    logger.LogTrace("UNABLE-TO-PROCESS! Unable to process an item: element ID: '{elementId}', CorrelationId: '{correlationId}'. Increasing the Failed Attempt from '{failedCounter}'", element.Id, element.CorrelationId, element.FailedCounter);

                    element.FailedCounter++;

                    if (!await processor.IsProcessableAsync(element))
                    {
                        // Since we can no longer process this item, delete it.
                        logger.LogTrace("UNABLE-TO-PROCESS! An item has became unprocessable: element ID: '{elementId}', CorrelationId: '{correlationId}'. The Failed Attempt is set to '{failedCounter}'", element.Id, element.CorrelationId, element.FailedCounter);

                        await queueStore.DeleteAsync(element);
                    }
                    else
                    {
                        await queueStore.SaveAsync(element);
                    }

                    continue;
                }

                logger.LogTrace("SUCCESS! An item was successfully processed: element ID: '{elementId}', CorrelationId: '{correlationId}'.", element.Id, element.CorrelationId);

                // At this point we know that the item was successfully processed.
                if (!String.IsNullOrEmpty(element.Expression))
                {
                    logger.LogTrace("RECURRING: Recurring expression found for item. Will try to create a new instance.", element.Id, element.CorrelationId, element.FailedCounter);

                    var endAtUtc = element.EndAtUtc ?? DateTime.MaxValue;

                    if (endAtUtc > startedAtUtc)
                    {
                        logger.LogTrace("RECURRING: The element will need future processing. element ID: '{elementId}', CorrelationId: '{correlationId}'.", element.Id, element.CorrelationId);

                        // At this point, we know that this item has a cron expression and could have a another instance.
                        // Let's try to figure out the next occurrence.
                        var schedule = CrontabSchedule.TryParse(element.Expression);

                        if (schedule != null)
                        {
                            element.LastRunUtc = startedAtUtc;

                            element.NextRunUtc = schedule.GetNextOccurrence(startedAtUtc, endAtUtc).AddMinutes(element.Minutes);

                            // Reset the failed counter and the processed flag since we are starting a new attempt.
                            element.IsDeleted = false;
                            element.EndAtUtc = null;

                            element.IsProcessed = false;
                            element.FailedCounter = 0;

                            logger.LogTrace("RECURRING: The element will need future processing. element ID: '{elementId}', CorrelationId: '{correlationId}'.", element.Id, element.CorrelationId);

                            await queueStore.SaveAsync(element);

                            continue;
                        }
                    }
                }

                logger.LogTrace("SUCCESS! The element was processed and completed successfully. element ID: '{elementId}', CorrelationId: '{correlationId}'.", element.Id, element.CorrelationId);

                // If we got this far, the item was processed and has no more occurrences.
                // We can safely remove it from the queue.
                await queueStore.DeleteAsync(element);
            }
        }

        logger.LogTrace("FINISH! The task background finish running after '{seconds}' seconds.", clock.UtcNow.Subtract(startedAtUtc).TotalSeconds);

        await queueStore.SaveChangesAsync();
    }

@jtkech
Copy link
Member

jtkech commented Nov 1, 2023

It may depend on what the processors are doing but I assume they are scoped services using the current session/connection. That said if you update many items and if they are indexed, there is an handler that create indexing tasks, and in IndexingTaskManager we create explicitly a connection.

using var connection = dbConnectionAccessor.CreateConnection();

@jtkech
Copy link
Member

jtkech commented Nov 1, 2023

But one connection per 100 items

@jtkech
Copy link
Member

jtkech commented Nov 1, 2023

But with different instance prefixes I assume

Yes this is something does by OC by default.

But it should be configured explicitly.

@MikeAlhayek
Copy link
Member Author

@jtkech there is only 1 index that this process uses during update or select. so I don't think this is a problem.

But, we create a transaction and we commit the code synchronously. I don't know where in OC we finally call SaveChangesAsync() to see how we commit the transaction. But from the IndexiningTaskManager you shared, the commit call does not use asynchronous operation. Which means it would lock the current thread until it is done. right?

@jtkech
Copy link
Member

jtkech commented Nov 1, 2023

Yes, we could have used await transaction.CommitAsync()

@jtkech
Copy link
Member

jtkech commented Nov 1, 2023

Did you tried the Kudu tools as suggested by @sebastienros?

@MikeAlhayek
Copy link
Member Author

@jtkech I can't seems to figure that part out. I think its because the app is running on lunix, generating memory dump is very limited.

@jtkech
Copy link
Member

jtkech commented Nov 1, 2023

For info there is also the IndexingBackgroundTask that call ProcessContentItemsAsync() that creates a new shell scope per 100 items. Maybe worth to try after disabling the Indexing.

Did you configure your Redis instance prefixes?

@MikeAlhayek
Copy link
Member Author

@jtkech what settings should be configured? There are there are two settings related to Redis

Configuration: TheInstanceName.redis.cache.windows.net:6380,password=ThePassword=,ssl=True,abortConnect=False,allowAdmin=true
InstancePrefix: this prefix is used to specify server prefix like production or staging so that the same cache key is unique to the service instance. I think when we add key to the Redis cache, we prefix the key with the tenant name so that the key is always unique (ie, {InstancePrefix}_{TenantName}_{EntryKey})

am I missing anything here?

@jtkech
Copy link
Member

jtkech commented Nov 1, 2023

InstancePrefix allows to share the same Redis instance by multiple apps, so it should be different per app, hmm maybe unless you don't have duplicate tenant names across all apps, but still recomended.

@MikeAlhayek
Copy link
Member Author

@jtkech every webapp has different InstancePrefix in my setup. so even if I am using the same tenant name in multiple instances the InstancePrefix will ensure that the key is unique. Either way, I am already using InstancePrefix. Unless we have a service that does not automatically add the tenant name to the key in OC that I am not aware of.

@jtkech
Copy link
Member

jtkech commented Nov 2, 2023

every webapp has different InstancePrefix in my setup.

Okay, so you configured them, hmm is it only part of your connection string? We have a dedicated config field for this to diferentiate the keys but still using the same Redis instance if this is what you want.

@MikeAlhayek
Copy link
Member Author

@jtkech

For info there is also the IndexingBackgroundTask that call ProcessContentItemsAsync() that creates a new shell scope per 100 items. Maybe worth to try after disabling the Indexing.

In one of my webapps, I am using ElasticSearch for every tenant but I did not enable OrchardCore.Search.Elasticsearch.Worker feature. So IndexingBackgroundTask is not used.

Okay, so you configured them, hmm is it only part of your connection string? We have a dedicated config field for this to diferentiate the keys but still using the same Redis instance if this is what you want.

It would be nice to be able to specify a different database per app. Like for app A use database 1, for app B user database 2 so that we can use multiple databases instead of just index. I think Redis has a limit of 16 databases by default that are available. But still would be nice to specify the database.

I configured the InstancePrefix by setting the following parameter and not directly in the connection string.

public string InstancePrefix { get; set; }

@MikeAlhayek
Copy link
Member Author

maybe I can specify defaultDatabase=2 parameter int he connection string to use a separate database. I have to try it out.

https://stackexchange.github.io/StackExchange.Redis/Configuration.html

@jtkech
Copy link
Member

jtkech commented Nov 2, 2023

I configured the InstancePrefix

Okay cool then

About multiple databases, yes we dicussed about this but I don't remember the details ;)

@ShaneCourtrille
Copy link
Contributor

ShaneCourtrille commented Nov 2, 2023

FYI enabling private endpoints on Redis, KeyVault, Azure Blob will also resolve any SNAT issues you're having

Oh and this probably requires your App Service be using a a vnet? I forget as someone else set it up for us.

@MikeAlhayek
Copy link
Member Author

@ShaneCourtrille so are you suggesting that this is an Azure setup issue and not related to code?

@MikeAlhayek
Copy link
Member Author

@sebastienros @jtkech I collected more memory dumps. Microsoft support cam back with some interesting findings.

First finding,

Most threads are on waiting and but more importantly a high memory usage.

The thread pool at the time the memory dump was collected was at 14% CPU

> threadpool
Using the Portable thread pool.
CPU utilization:  14%
Workers Total:    7
Workers Running:  7
Workers Idle:     0
Worker Min Limit: 1
Worker Max Limit: 32767

In the finalizer queue there were 3995 objects of System.Reflection.Emit.DynamicResolver

7e563a9081e0    39       936 System.WeakReference<System.Diagnostics.Tracing.EventProvider>
7e563a908500    39       936 System.WeakReference<System.Diagnostics.Tracing.EventSource>
7e563fec9990    21     1,008 OrchardCore.Environment.Shell.ShellSettings
7e5642daa518     5     1,040 Microsoft.Data.SqlClient.SqlConnection
7e5643501f28    11     1,232 YesSql.Session
7e563f8b0b88    20     1,280 System.Net.Sockets.SafeSocketHandle
7e56431befe8    36     1,440 Microsoft.Win32.SafeHandles.SafeAllocHHandle
7e5640890018    18     1,584 OrchardCore.Environment.Shell.Builders.ShellContext+PlaceHolder
7e563fce3690     7     1,792 System.Net.Security.SslStream
7e564075bf88     8     1,984 Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketReceiver
7e563b8d6058    20     2,400 System.Net.Sockets.Socket
7e563a8d0d00    35     2,520 System.Threading.Thread
7e563a8d0ed0   189     4,536 System.WeakReference
7e563a907be0    78     4,992 System.Diagnostics.Tracing.EventSource+OverrideEventProvider
7e563f8e0568    22     6,512 System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs
7e563a98c3e0 3,995   287,640 System.Reflection.Emit.DynamicResolver
Total 4,912 objects, 346,664 bytes

Now analyzing what is on the Heap

> gcheapstat
Heap     Gen0       Gen1       Gen2       LOH        POH        FRZ
Heap0    26214360   25822784   2278450688 3407032    683072
Total    26214360   25822784   2278450688 3407032    683072
Free space:
Heap     Gen0       Gen1       Gen2       LOH        POH        FRZ
Heap0    78080      14679792   35294280   1391832    0          SOH:2%     LOH:40%  POH:0%
Total    78080      14679792   35294280   1391832    0
Committed space:
Heap     Gen0       Gen1       Gen2       LOH        POH        
Heap0    26218496   26218496   2281701376 3538944    684032
Total    26218496   26218496   2281701376 3538944    684032

The Gen 2 is using 2.28170138 GB of private bytes. Private Bytes is the current size, in bytes, of memory that this process has allocated that cannot be shared with other processes.

The current Azure plan is set t0 3.5GB, 1 Core, P1v2 plan.

Thread 34 seems to be using GC Mode cooperative which when it runs GC it will block all threads. If a thread's GC mode is set to Preemptive, this means that the GC can suspend this thread at any time. In contrast, the Cooperative mode means that the GC has to wait for the thread to switch to the Preemptive mode before you suspend it.

> clrthreads
ThreadCount:      38
UnstartedThread:  0
BackgroundThread: 29
PendingThread:    0
DeadThread:       8
Hosted Runtime:   no
                                                                                                            Lock
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1        1 000063907EA63970  2020020 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
   5    2        d 000063907EA8EF20    21220 Preemptive  00007E16B1CE7078:00007E16B1CE9050 000063907EA6F5F0 -00001 Ukn (Finalizer)
   6    4       1a 000063907EE66820    21220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
   7    5       1c 000063907EE1B040    21220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
   8    6       1d 000063907EE8B250    21220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
   9    7       1e 000063907EEE27B0  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  10    8       1f 000063907EEEC680    21220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
XXXX    9        0 000063907EE812D0  1031820 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
  11   10       23 000063907EEF1230  1021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
  12   12       27 000063907EF2B0E0  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  13   13       28 000063907EF406F0    21220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  15   15       30 000063907EF69AF0  2021220 Preemptive  00007E16B2800C58:00007E16B2800FD0 000063907EA6F5F0 -00001 Ukn
  16   21       3b 00007E15802AFB70  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  17   22       3c 00007E15802B0D90  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  18   23       3d 00007E15802B2A70    21220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  19   24       3e 00007E15802B4930  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  20   25       3f 00007E15802B68B0  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  21   26       40 00007E15802BAC20  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  22   27       41 00007E15802BCBE0  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  23   28       42 00007E15802BEBA0  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  24   29       43 00007E15802C0B60  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  25   30       44 00007E15802C2B20  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  26   17       51 00007E563416E840  2021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
XXXX   11        0 00007E562803E2A0  1031820 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
XXXX   33        0 00007E56300490D0  1031820 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
XXXX   34        0 00007E562404E660  1031820 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
  27   31      30a 00007E563004F600  1021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
XXXX   19        0 00007E158838AD50  1031820 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
  28   14      30c 000063907F1897F0  3021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
XXXX   20        0 00007E56300435D0  1031820 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
XXXX   18        0 00007E5624055810    31820 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  29   16      327 00007E158021EA00    21220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  30   32      32a 00007E158003C700  1021220 Preemptive  00007E16B3500300:00007E16B3500358 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
  31    3      32c 00007E563417B6D0  1021220 Preemptive  00007E16B34CD080:00007E16B34CD090 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
  32   35      32d 000063907EFE1F10  1021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
  33   36      32e 00007E158407A230  1021220 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn (Threadpool Worker)
XXXX   37        0 00007E15840400E0    31820 Preemptive  0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn
  34   38      330 00007E5634644600    21220 Cooperative 0000000000000000:0000000000000000 000063907EA6F5F0 -00001 Ukn 

In summary, the following are what is causing this issue

1- The web app is suffering High Memory due to private bytes.
2- There are a lot of thread management operations by PortableThreadPool+WorkerThread.WorkerThreadStart() this needs to be corrected as well.

Question, what could be creating the Cooperative thread that is blocking? Is it YesSql or OC?

What is creating and managing PortableThreadPool+WorkerThread.WorkerThreadStart()

@jtkech
Copy link
Member

jtkech commented Dec 13, 2023

Not so obvious to read but I can see a lot of WeakReference, we use them to make a shell dependent on another one, as I remember only when a tenant OpenId scope relies on a tenant OpenId application.

And when an ExpiredHandlerTrackingEntry (see below) is created to track an HttpClient handler.

Hmm are you registering an HttpClient handler that would be always referenced or being itself a singleton, this in place of being for example transient?

internal sealed class ExpiredHandlerTrackingEntry
{
    private readonly WeakReference _livenessTracker;

    // IMPORTANT: don't cache a reference to `other` or `other.Handler` here.
    // We need to allow it to be collected by the GC.
    public ExpiredHandlerTrackingEntry(ActiveHandlerTrackingEntry other)
    {
        Name = other.Name;
        Scope = other.Scope;

        _livenessTracker = new WeakReference(other.Handler);
        InnerHandler = other.Handler.InnerHandler!;
    }

@sebastienros
Copy link
Member

What is creating and managing PortableThreadPool+WorkerThread.WorkerThreadStart()

Kestrel. This is standard thread-pool work, I don't see anything to worry about.

Finalizers

Session invokes GC.SuppressFinalize() when Dispose() is called so there should not be any pending finalizers for Session if we call it correctly: https://github.com/sebastienros/yessql/blob/main/src/YesSql.Core/Session.cs#L904

Same comment for ShellSettings and PlaceHolder. To see when it goes into the finalizer we could try to add some Console logging locally and see if it hits this path sometimes.

I see DynamicResolver doesn't have anything suppressing the finalizer, so it might be by design: https://source.dot.net/#System.Private.CoreLib/src/System/Reflection/Emit/DynamicILGenerator.cs,622

But it's possible that we don't use it as intended. @jtkech any idea what could use it? I could think about DI (as it might generate lambdas dynamically to resolve services in constructors).

Private bytes

Probably your code that allocates too much, and the GC is overwhelmed. Do batches, reuse memory and data structures as much as possible.

@jtkech
Copy link
Member

jtkech commented Dec 14, 2023

@jtkech any idea what could use it?

I will think about it, maybe when using reflection around the dynamic ContentItem.Content, or while deserializing json elements from ContentItem.Data to ContentElements.

@jtkech
Copy link
Member

jtkech commented Dec 14, 2023

Oh yes, I missed in the list DynamicResolver that is much higher than WeakReference ;)

@jtkech
Copy link
Member

jtkech commented Dec 14, 2023

Or to resolve query delegates not known at compile time.

#14616 (comment)

@jtkech
Copy link
Member

jtkech commented Dec 19, 2023

Looks like we have a similar issue for a customer with a lot of rooted System.Reflection.Emit.DynamicResolver since we upgrade from OC 1.4 to OC 1.5, I will investigate.

DynamicResolver

@sebastienros
Copy link
Member

What makes you say it's between 1.4 and 1.5?

@jtkech
Copy link
Member

jtkech commented Dec 19, 2023

It happens since we upgraded the customer application from 1.4 to 1.5

But not fully sure it is related to the upgrading

I will investigate

@sebastienros
Copy link
Member

Things I noticed are using IL compilation:

  • [ShapeAttribute] in the shape table provider
  • DateParser in AuditTrail (via Parlot)
  • Anything that uses Parlot compilation (I checked and we don't compile LiquidViewParser)

What I saw in DynamicResolver source is that there is a cyclic dependency between it and DynamicMethod, which is then passed to the delegate resulting from Delegate CreateDelegate(Type delegateType, object? target).

https://github.com/dotnet/runtime/blob/main/src/coreclr/System.Private.CoreLib/src/System/Reflection/Emit/DynamicMethod.CoreCLR.cs#L36

https://github.com/dotnet/runtime/blob/main/src/coreclr/System.Private.CoreLib/src/System/Reflection/Emit/DynamicMethod.CoreCLR.cs#L60

@jtkech
Copy link
Member

jtkech commented Dec 19, 2023

Thanks for the info, we don't use AuditTrail, I will look at it asap.

@sebastienros
Copy link
Member

I can repro with a simple blog recipe by hitting the about page. Took a memory dump, analyzed in perfview. What version of dotnet did we migrate to in 1.5? Could be some changes in DI.

@jtkech
Copy link
Member

jtkech commented Dec 19, 2023

1.5 uses net6.0 same as with 1.4

@jtkech
Copy link
Member

jtkech commented Dec 20, 2023

This lets me think about

_activatorByFactory = Delegate.CreateDelegate(typeof(Func<ITagHelperFactory, ViewContext, ITagHelper>),
factoryMethod) as Func<ITagHelperFactory, ViewContext, ITagHelper>;

private static Action<object, object> MakeFastPropertySetter(Type type, PropertyInfo prop)
{
// Create a delegate TDeclaringType -> { TDeclaringType.Property = TValue; }
var setterAsAction = prop.SetMethod.CreateDelegate(typeof(Action<,>).MakeGenericType(type, prop.PropertyType));
var setterClosedGenericMethod = _callPropertySetterOpenGenericMethod.MakeGenericMethod(type, prop.PropertyType);
var setterDelegate = setterClosedGenericMethod.CreateDelegate(typeof(Action<object, object>), setterAsAction);
return (Action<object, object>)setterDelegate;
}

@sebastienros
Copy link
Member

Not sure this is generating IL. This is CreateDelegate from a MethodInfo, not a DynamicMethodInfo.
Also I removed most of these tag helpers to use dedicated Fluid implementations. There could be a few left, but not much.

@jtkech
Copy link
Member

jtkech commented Dec 20, 2023

Yes, was just a thinking before having time to look at it more in depth

@sebastienros
Copy link
Member

I double checked the finalizers for Session, and they are all disposed correctly. When using dotnet gcdump which is doing a collection before getting a memory dump they don't appear. And the destructors are not invoked, only the Dispose/Async, at the end of the request (scope). So I think the dump from the task manager are giving a wrong idea of the finalizer queue, the code won't be invoked in the end.

For DynamicResolver I still see them but mostly due to DI.

@jtkech
Copy link
Member

jtkech commented Dec 20, 2023

Okay thanks for the info

Yes when I do a dump from perfView I can force the GC but this one was not done by me.

@MikeAlhayek
Copy link
Member Author

@jtkech I missed you're comment earlier about handlers. Here is how I register my http clients

services.AddScoped<SendbirdApplicationService>()
            .AddHttpClient<SendbirdApplicationService>((serviceProvider, client) =>
        {
            var options = serviceProvider.GetRequiredService<IOptions<SendbirdOptions>>().Value;

            client.BaseAddress = new Uri("https://gate.sendbird.com/api/v2/", UriKind.Absolute);
            client.DefaultRequestHeaders.TryAddWithoutValidation("SENDBIRDORGANIZATIONAPITOKEN", options.OrganizationApiKey);
        }).AddTransientHttpErrorPolicy(policy => policy.WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(0.5 * attempt)));

What's strange that I am noticing is that the AddHttpClient call back is called multiple times on every request. Even when I am not requesting or using SendbirdApplicationService. I am think that call back should be called, unless the SendbirdApplicationService service is requested.

@jtkech
Copy link
Member

jtkech commented Dec 20, 2023

Because SendbirdApplicationService is a typed HttpClient it happens as soon as it is injected/resolved, it would mean for example that a scoped service is resolved/used and that this scope service injects SendbirdApplicationService even if it doesn't use it.

@MikeAlhayek
Copy link
Member Author

Yes but it is called even when it is not injected. It seems to be called on every request even when no service is using it/injecting it.

I'll check again tomorrow and provide more details after trying it again.

@jtkech
Copy link
Member

jtkech commented Dec 20, 2023

Okay, let me know, also check if not resolved by GetService<>() for example

@MikeAlhayek
Copy link
Member Author

MikeAlhayek commented Dec 20, 2023

@jtkech that service services was injected in different handlers like IUserEventHandler and IFeatureEventHandler. So clicking on "Features" to list the features in the app would require creating HttpClient.

I changed it so these services are only resolved on demand directly from IServiceProvider. I usually don't like resolving services directly from IServiceProvider, but in this case, I had to. The other option is to inject IHttpClientFactory in these services and manually create the client when needed using IHttpClientFactory directly.

Update

I changed the code so it resolves IHttpClientFactory instead of HttpClient this is better so I don't have to resolve my other services lazily all over the place. Anyhow, creating a new HttpClient in itself should not be a problem. I don't think creating HttpClient auto uses a new post. When using action methods like Get, Post, etc, then an http connection is used.

@jtkech
Copy link
Member

jtkech commented Dec 20, 2023

I usually don't like resolving services directly from IServiceProvider

Yes, using a typed client is like using the client factory but all is auto done when the tied service is resolved, useful for example for a controller whose all action(s) use a given HttpClient, in that case you just need to inject the service (tied to the typed client) that will be only resolved when the controller gets activated.

But this may not be the case when resolved from an handler, filter, in that case better to use the http client factory or lazily resolve the tied service which at the end will do the same things.

creating a new HttpClient in itself should not be a problem.

Yes, the problem is not the HttpClient in itself but CreateClient() also add ActiveHandlerTrackingEntrys that start timers to manage the lifetime of handlers.

@Piedone
Copy link
Member

Piedone commented Apr 15, 2024

Any news here since then?

@Piedone Piedone added this to the 2.x milestone May 21, 2024
@MikeAlhayek MikeAlhayek modified the milestones: 2.x, 2.0 Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants