-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNAT PORT EXHAUSTION - possibly due to HttpClient #14616
Comments
Interestingly, I believe the httpclient has a connections pool, so I found it weird we could create as many as we'd want with the handlers cache. So maybe we really should not release these, have them reused as a local field, or dispose them after their usage to have them free up the connections taken by the pool. |
Can't find the pool in the HttpClient per se, so must be in the handlers. Better reuse the httpclient instance then. It might not come from the RecaptchaService though, could be in another class where we inject HttpClient/HttpClientFactory incorrectly. Would be interesting to know what endpoints these are, or where the HttpClients are created (analyze trace/memory dump to find root instance) |
@sebastienros I am not sure how to go about analyzing or obtaining memory dump from Azure WebApp. What is odd here is that the Post request to generate the report what cause this issue. The report query the database using ISession, then does it's logic in the memory. I don't think YesSQL uses HttpClient or the factory for anything. none of the code uses HttpClient "at least not for that reports". Count it be Redis? I am not using cache distribution directly for this report. Still not sure why this problem would occurs when I am using ISession to query data. |
Was it working before? Do you have many tenants? Does your report may trigger a tenant release? (Because the factory is now at the tenant level). |
@jtkech I am not sure if it worked before. The report has been used for a while with no issue. but I am not sure if the report was generated before with a large amount of data. We did not add new tenants (I think less than 10 tenants). We do not do a manual release of the tenant (unless it does it on it's own due to some sort of memory issues). The report fetches the data, created an
|
Or creating new db connections, to build your report are you doing many requests and/or in the code creating a new shell scope in a loop? |
In the Kudu tools you can create a full memory dump, then open it in Visual Studio and look for HttClient or SqlConnection instances. |
No scopes here. Here is the report logic "which works for small date range (less data)"
|
Didn't see any Are you using the distributed tenants feature that runs in the background? Related to this
But with different instance prefixes I assume We also introduced media cleanup background tasks but they run once a day around midnight UTC. |
@MikeAlhayek yeah we have it in prod since that preview release, but we didn't meet with this problem. |
@MikeAlhayek no, I haven't tried the preview release |
Yes this is something does by OC by default.
Yes I am. This feature is a host level feature only. What is strange is that when I send the request to generate the report, all web apps stop responding. I have a background task that runs every minute and process a queue (that is stored in the database). I don't think it would could an issue but just sharing this with you in case you see an issue with it. This background process places an explicate lock to ensure it only runs on one node. Here is the background service [BackgroundTask(
Title = "Queue Processor",
Schedule = "* * * * *",
Description = "Processes queue elements from the queue.",
UsePipeline = true)]
public class QueueProcessorBackgroundTask : IBackgroundTask
{
private static string _lockerKey = typeof(QueueProcessorBackgroundTask).FullName;
public async Task DoWorkAsync(IServiceProvider serviceProvider, CancellationToken cancellationToken)
{
// To prevent multiple instances running in parallel on multiple containers, lets create a lock.
var distributedLock = serviceProvider.GetRequiredService<IDistributedLock>();
var shellSettings = serviceProvider.GetRequiredService<ShellSettings>();
var (locker, locked) = await distributedLock.TryAcquireLockAsync(_lockerKey + shellSettings.Name, TimeSpan.FromMicroseconds(2000));
if (!locked)
{
// Could not place a lock. This means that another instance is already running.
return;
}
// It is important to use a using block here so that the lock is released at the end.
await using var acquiredLock = locker;
var queueStore = serviceProvider.GetRequiredService<IQueueStore>();
var clock = serviceProvider.GetRequiredService<IClock>();
var localClock = serviceProvider.GetRequiredService<ILocalClock>();
var startedAtUtc = clock.UtcNow;
// Get queue elements that are due.
var items = await queueStore.GetDueAsync(startedAtUtc);
if (!items.Any())
{
// Nothing to process.
return;
}
var groups = items.GroupBy(x => x.QueueName)
.Select(x => new
{
QueueType = x.Key,
Context = new QueueElementProcessingContext()
{
Elements = x.ToList(),
},
}).ToList();
var processors = serviceProvider.GetServices<IQueueElementProcessor>();
var logger = serviceProvider.GetRequiredService<ILogger<QueueProcessorBackgroundTask>>();
foreach (var group in groups)
{
var processor = processors.FirstOrDefault(processor => processor.CanHandle(group.QueueType));
if (processor == null)
{
logger.LogWarning("Unable to find a queue element processor that is able to handle the '{queueType}' QueueType.", group.QueueType);
continue;
}
logger.LogTrace("QUEUE-ELEMENT: About to process queue-elements for '{queueType}' Queue-type.", group.QueueType);
await processor.ProcessAsync(group.Context);
foreach (var element in group.Context.Elements)
{
if (element.IsRemovable)
{
logger.LogTrace("REMOVE! An item was flagged as removable after it was processed: '{elementId}', CorrelationId: '{correlationId}'.", element.Id, element.CorrelationId);
// Since this item was marked for removal, delete it.
await queueStore.DeleteAsync(element);
continue;
}
element.Counter++;
if (!element.IsProcessed)
{
// At this point, we know that the item was not processed.
// Since a failed attempt was made, let's first increment the failed counter.
logger.LogTrace("UNABLE-TO-PROCESS! Unable to process an item: element ID: '{elementId}', CorrelationId: '{correlationId}'. Increasing the Failed Attempt from '{failedCounter}'", element.Id, element.CorrelationId, element.FailedCounter);
element.FailedCounter++;
if (!await processor.IsProcessableAsync(element))
{
// Since we can no longer process this item, delete it.
logger.LogTrace("UNABLE-TO-PROCESS! An item has became unprocessable: element ID: '{elementId}', CorrelationId: '{correlationId}'. The Failed Attempt is set to '{failedCounter}'", element.Id, element.CorrelationId, element.FailedCounter);
await queueStore.DeleteAsync(element);
}
else
{
await queueStore.SaveAsync(element);
}
continue;
}
logger.LogTrace("SUCCESS! An item was successfully processed: element ID: '{elementId}', CorrelationId: '{correlationId}'.", element.Id, element.CorrelationId);
// At this point we know that the item was successfully processed.
if (!String.IsNullOrEmpty(element.Expression))
{
logger.LogTrace("RECURRING: Recurring expression found for item. Will try to create a new instance.", element.Id, element.CorrelationId, element.FailedCounter);
var endAtUtc = element.EndAtUtc ?? DateTime.MaxValue;
if (endAtUtc > startedAtUtc)
{
logger.LogTrace("RECURRING: The element will need future processing. element ID: '{elementId}', CorrelationId: '{correlationId}'.", element.Id, element.CorrelationId);
// At this point, we know that this item has a cron expression and could have a another instance.
// Let's try to figure out the next occurrence.
var schedule = CrontabSchedule.TryParse(element.Expression);
if (schedule != null)
{
element.LastRunUtc = startedAtUtc;
element.NextRunUtc = schedule.GetNextOccurrence(startedAtUtc, endAtUtc).AddMinutes(element.Minutes);
// Reset the failed counter and the processed flag since we are starting a new attempt.
element.IsDeleted = false;
element.EndAtUtc = null;
element.IsProcessed = false;
element.FailedCounter = 0;
logger.LogTrace("RECURRING: The element will need future processing. element ID: '{elementId}', CorrelationId: '{correlationId}'.", element.Id, element.CorrelationId);
await queueStore.SaveAsync(element);
continue;
}
}
}
logger.LogTrace("SUCCESS! The element was processed and completed successfully. element ID: '{elementId}', CorrelationId: '{correlationId}'.", element.Id, element.CorrelationId);
// If we got this far, the item was processed and has no more occurrences.
// We can safely remove it from the queue.
await queueStore.DeleteAsync(element);
}
}
logger.LogTrace("FINISH! The task background finish running after '{seconds}' seconds.", clock.UtcNow.Subtract(startedAtUtc).TotalSeconds);
await queueStore.SaveChangesAsync();
} |
It may depend on what the processors are doing but I assume they are scoped services using the current session/connection. That said if you update many items and if they are indexed, there is an handler that create indexing tasks, and in
|
But one connection per 100 items |
But it should be configured explicitly. |
@jtkech there is only 1 index that this process uses during update or select. so I don't think this is a problem. But, we create a transaction and we commit the code synchronously. I don't know where in OC we finally call |
Yes, we could have used |
Did you tried the |
@jtkech I can't seems to figure that part out. I think its because the app is running on lunix, generating memory dump is very limited. |
For info there is also the Did you configure your Redis instance prefixes? |
@jtkech what settings should be configured? There are there are two settings related to Redis Configuration: am I missing anything here? |
|
@jtkech every webapp has different |
Okay, so you configured them, hmm is it only part of your connection string? We have a dedicated config field for this to diferentiate the keys but still using the same Redis instance if this is what you want. |
In one of my webapps, I am using ElasticSearch for every tenant but I did not enable
It would be nice to be able to specify a different database per app. Like for app A use database 1, for app B user database 2 so that we can use multiple databases instead of just index. I think Redis has a limit of 16 databases by default that are available. But still would be nice to specify the database. I configured the
|
maybe I can specify https://stackexchange.github.io/StackExchange.Redis/Configuration.html |
Okay cool then About multiple databases, yes we dicussed about this but I don't remember the details ;) |
FYI enabling private endpoints on Redis, KeyVault, Azure Blob will also resolve any SNAT issues you're having Oh and this probably requires your App Service be using a a vnet? I forget as someone else set it up for us. |
@ShaneCourtrille so are you suggesting that this is an Azure setup issue and not related to code? |
@sebastienros @jtkech I collected more memory dumps. Microsoft support cam back with some interesting findings. First finding,
The thread pool at the time the memory dump was collected was at 14% CPU
In the finalizer queue there were 3995 objects of
Now analyzing what is on the Heap
The Gen 2 is using The current Azure plan is set t0 3.5GB, 1 Core, P1v2 plan.
In summary, the following are what is causing this issue 1- The web app is suffering High Memory due to private bytes. Question, what could be creating the What is creating and managing |
Not so obvious to read but I can see a lot of And when an Hmm are you registering an
|
Kestrel. This is standard thread-pool work, I don't see anything to worry about.
Session invokes Same comment for I see But it's possible that we don't use it as intended. @jtkech any idea what could use it? I could think about DI (as it might generate lambdas dynamically to resolve services in constructors).
Probably your code that allocates too much, and the GC is overwhelmed. Do batches, reuse memory and data structures as much as possible. |
I will think about it, maybe when using reflection around the dynamic |
Oh yes, I missed in the list |
Or to resolve query delegates not known at compile time. |
What makes you say it's between 1.4 and 1.5? |
It happens since we upgraded the customer application from 1.4 to 1.5 But not fully sure it is related to the upgrading I will investigate |
Things I noticed are using IL compilation:
What I saw in |
Thanks for the info, we don't use AuditTrail, I will look at it asap. |
I can repro with a simple blog recipe by hitting the about page. Took a memory dump, analyzed in perfview. What version of dotnet did we migrate to in 1.5? Could be some changes in DI. |
1.5 uses net6.0 same as with 1.4 |
This lets me think about Lines 140 to 141 in a5395f4
Lines 225 to 233 in a5395f4
|
Not sure this is generating IL. This is CreateDelegate from a MethodInfo, not a DynamicMethodInfo. |
Yes, was just a thinking before having time to look at it more in depth |
I double checked the finalizers for Session, and they are all disposed correctly. When using For |
Okay thanks for the info Yes when I do a dump from perfView I can force the GC but this one was not done by me. |
@jtkech I missed you're comment earlier about handlers. Here is how I register my http clients services.AddScoped<SendbirdApplicationService>()
.AddHttpClient<SendbirdApplicationService>((serviceProvider, client) =>
{
var options = serviceProvider.GetRequiredService<IOptions<SendbirdOptions>>().Value;
client.BaseAddress = new Uri("https://gate.sendbird.com/api/v2/", UriKind.Absolute);
client.DefaultRequestHeaders.TryAddWithoutValidation("SENDBIRDORGANIZATIONAPITOKEN", options.OrganizationApiKey);
}).AddTransientHttpErrorPolicy(policy => policy.WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(0.5 * attempt))); What's strange that I am noticing is that the AddHttpClient call back is called multiple times on every request. Even when I am not requesting or using |
Because |
Yes but it is called even when it is not injected. It seems to be called on every request even when no service is using it/injecting it. I'll check again tomorrow and provide more details after trying it again. |
Okay, let me know, also check if not resolved by |
@jtkech that service services was injected in different handlers like I changed it so these services are only resolved on demand directly from UpdateI changed the code so it resolves |
Yes, using a typed client is like using the client factory but all is auto done when the tied service is resolved, useful for example for a controller whose all action(s) use a given But this may not be the case when resolved from an handler, filter, in that case better to use the http client factory or lazily resolve the tied service which at the end will do the same things.
Yes, the problem is not the |
Any news here since then? |
I have 3 different Azure WebApps hosted in Azure. It may not matter, but I deploy my apps using docker containers. All of my app are connected to Azure SQL server instances (each app uses a different database), the same Shell blob storage, the same Media blob storage, the same Redis instance.
One site ran a report that pulls lots of content items, then does in aggregation to generate the final report. But this report, caused all 3 WebApps to got down at the same time! Yes all 3 different WebApps went down. not fun!
After lots of investigation, Microsoft team concluded that all of my web apps were a victim of SNAT PORT EXHAUTION. Here is more info about the issue and possible fixes.
What is SNAT?
SNAT means Source Network Address Translation. As you may already know, the Outbound/Inbound IP addresses correspond to the IP addresses of the public front end of your app's scale unit.
Whenever your application creates an outbound connection, SNAT will rewrite the source IP from the backend instance's IP address to the outbound public IP of the scale unit's frontend.
For example: If my app creates a connection to a SQL Database through public traffic, the SQL Server will see the frontend's outbound IP address as the client IP.
Each scale unit (or as some name it, stamp) has more than 65000 ports available for all of the instances that are in the same public stamp. Since this is a multi-tenant stamp, multiple users can be in the same stamp.
(Each user with their own instances allocated to themselves if they have a pricing tier of Basic or more).
Remember that you always have at least 128 ports preallocated to yourself, but this is not a limit and can be exceeded.
For each TCP connection to reuse an SNAT port, the 5 tuple has to be unique (TCP (or UDP), Source IP, Source Port, Destination IP, and Destination Port ).
Changing any of the destination information will of course break the connection and the source IP is the front end's outbound IP address, which we of course cannot change.
This means that we can only change the Destination port:
If the destination is different, we reuse the SNAT Port:
If all tuple is the same, a new port will be used:
For example, if an app opens one connection per second 1000 seconds to the same destination, it will use 1000 ports. On the other hand, if you open multiple connections to different destinations some of those connections will reuse the same port.
Based on your apps performance, you consumed over 474 SNAT Ports with 576 allocated on WebApp1
And you consumed over 474 SNAT Ports with 576 allocated on WebApp2
Both the same parameters
Also, if an app opens a connection to a database and database's server fails to send an ACK response back within 230 seconds, the connection will remain idle for 230 seconds causing other connections to remain in the queue which could possibly lead to SNAT port Exhaustion.
It is also important to know that whenever a client/server sends a FIN control flag, the TCP connection will remain in a TIME_WAIT state for 120 seconds and then free the SNAT Port. In other words, when the client or server acknowledges to close the connection, there is 120 seconds of 'cooldown' before freeing the SNAT Port
Note: Remember that multiple HTTP Requests can go on the same TCP connection.
What causes SNAT Port Exhaustion?:
How can we prevent SNAT Port Exhaustion?
@jtkech I am using nightly previews in production. I am guessing this issue has to do with the recent improvements that you did around IHttpClientFactory.
@rjpowers10 @ShaneCourtrille @wAsnk have you tested out the preview in your production environment after last changed to the HttpClientFactory?
@sebastienros this is a weird one and you may have additional info.
The text was updated successfully, but these errors were encountered: