-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chance of getting 409 while renewing the primary host lock #1864
Comments
For now solution is to add settings.job with
It will not scale but at least it will not fail.. |
@brettsam You said "acquires the lock just before the old host releases it" - how can that happen? The lease can only be held by one. Perhaps what you're saying is something like:
However, PrimaryHostCoordinator.AcquireOrRenewLeaseAsync seems to already handle this case correctly - if it doesn't have the lease and fails to acquire it, no error is thrown. |
@mathewc, the lease can be held by multiple hosts inside single instance as we use the same leaseId (instanceId) to acquire the lease. It's flow @brettsam mentioned: |
I see, thanks for clarifying. If the problem is that the renewal fails because we already have the lock then can't we handle that condition by ignoring that specific error? |
Work has been merged in 1.x. Need to make sure this isn't an issue in 2.0. |
Related to #2326 |
Overnight I saw this error a number of times, but it caused enough of a delay (5-10mins!) in consumption of ServiceBus messages that it triggered several application-level errors for us as processing these messages in a timely fashion is critical. Currently these are v2 functions with ServiceBus triggers deployed to an AppService that is scaled out on 2-N instances. Paul |
Disregard my previous - after further investigation these errors were just coincidental to another network issue outside of Azure. |
I'm also seeing this error. This is the most recent instance:
I'm running on Functions v2 and all of the functions inside this app are HTTP triggered and short living. I agree with the other people in this thread. If this isn't an issue, it shouldn't be logged as an error. Errors are something that we need to look at and not simply ignore (IMO). |
Idea! Maybe use structured logging properties for the |
I have a webapp running azure functions and I am getting this everytime I restart the webapp or swap slots. 2020-09-15T10:01:12Z [Error] Singleton lock renewal failed for blob '.../host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2020-09-15T10:00:47.566Z (24432 milliseconds ago) with a duration of 3 milliseconds. The lease period was 15000 milliseconds. |
@wpitallo any other side effect? This may happen if the primary host changes and the lease is stolen by another instance. |
Not that I am aware of at this stage. |
Same issue here with an Azure (~3) timer triggered function;
Last successful lock time on this is weird. Doesn't seem to cause problems, just fills app insights with sev 3 errors |
Hi All I am facing same issue in my Azure Function app. It seems like the real problem is that in non prod environments we are using single blob storage used by multiple Azure function apps. Hence when renew lease operation is performed by one of the Azure function app, then due multiple azure function apps using blob storage, the error of "Singleton lock renewal failed for blob" happens. Which in turns lead to restart of job host and hence azure function abruptly stops without any function error. I can conclude the above fact based on the my investigation that our production environment doesn't have same setup of using single blob storage used by multiple azure function apps. And we have never face "Singleton lock renewal failed for blob" error in that environment. So as per my understanding the solution to this error is avoid sharing single blob storage with multiple Azure functions. TIA |
I can confirm that this happens in environments with individual storage accounts for each function app as well. I always create a new storage account for every new function app I create. Sharing storage accounts introduce some other problems why I wouldn't recommend sharing storage accounts between function apps ever 👍 |
I'll be following this, since we also see this problem with a consumption plan Azure Function App |
We receive this error every time our durable function executes (we are using dedicated storage for this function and it is running in an isolated application service environment). Unfortunately this appears to halt the function execution after 80 seconds. So we have a durable function that will run for ~80 seconds, then we get the "Singleton lock renewal failed" message in the log traces and no further logging from our application code until it retries the durable function activity 30 minutes later. It cycles like this indefinitely, running for ~80 seconds every 30 minutes but never completing. It doesn't seem to be directly related to our application logic because within that 80 second period the application logic can get further along (e.g. if we bump the thread count). And if we run the durable activity logic directly on a local instance of the docker container it will run to completion. So it seems to be time bound, and prevents our application code from executing despite no errors in the application logic itself. We are running on version ~4, but downgrading to ~3 didn't seem to help. |
I started seeing this after updating (function) apps (version ~4 / .net 6) and updating all of dependencies, there are only time triggers (multiple) and service bus triggers (multiple) there is 1 storage account. on .net3.1 and some older library versions I havent seen those logging, perhaps it's just some noise, I havent spotted yet if that affects function execution in any way. Request [xxxxxxxxxxx] PUT https://name.blob.core.windows.net/azure-webjobs-hosts/locks/fname/host?comp=lease Error response [xxxxxxxxxxxx] 409 There is already a lease present. (00.0s) Those are logs of severity "Warning" in insights |
I also started seeing this after updating (function) apps (version ~4 / .net 6) and updating all of dependencies. A single Service Bus Trigger. 1 Function, 1 Storage account. Didnt see this before update to Function v4 and net 6 |
This was affecting my Azure Functions V4 apps with .NET 6 with deployment slots. What I ended up discovering was that one of my swap slot was using the same Storage Account connection string as my production Slot. Once I changed this the lease problem stopped appearing in Application Insights. |
@Bpflugrad - that's good info, thanks! |
I am also seeing this error, although I am using a single Storage Account for a Function App. The issue appears when scaling out to multiple instances. I have just one BlobTrigger function and two TimerTrigger functions. Is there a way to avoid this error or at least make it not log as an exception in App Insights? |
Same for me, we discovered after our log analytics costs increased (!). Like @Bpflugrad I found I had slots sharing the same storage account, but even after fixing this the issue remains. |
Adding, also get this issue with any trigger type at random intervals. We tried to scale up and out but this did not provide a solution. Happening on multiple function apps with no correlation. Advised that this is is just noise and too ignore but it still shows as an error in insights which is not ideal. |
It happens in function apps (version ~3 / .net core 3.1) too. |
Also seeing this error (409 with 'LeaseIdMismatchWithLeaseOperation') on one of our function apps (Runtime version ~4). Are there no easy way to silence these errors if they don't really matter? |
I see this event and another thread start processing the message. Any ideas how we can avoid the issue? Unless you code for idempotency this will be an issue. For example, people will get invitation emails twice if not coded to exclude resending emails if sent in last minute. More details here. https://stackoverflow.com/questions/77070373/azure-functions-service-bus-trigger-settings-maxautolockrenewalduration-or-maxau |
While investigating another issue I stumbled upon a lot of these 409s coming from renewing the primary host lock. I did some investigation and I believe that, because we do a 'fire and forget' with the host disposal, it's possible that a new host comes up and acquires the lock just before the old host releases it. That means that the next attempt to renew fails, as the lock has been released.
The error looks like:
Singleton lock renewal failed for blob 'locks/AppName/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2017-09-02T18:38:31.526Z (111006 milliseconds ago) with a duration of 57508 milliseconds. The lease period was 15000 milliseconds.
I think it's happening with something like below.
Note 1 -- this is all on a single instance and references to hosts are about the hosts being cycled via the
ScriptHostManager
.Note 2 -- All of the host locks use the current machine id as the proposed lease, which allows multiple hosts on the same machine modify the lease.
Host1 -> acquire lock
Host1 -> renew lock
something triggers host recycle and creates a new host while disposing of the old host
Host2 -> acquire lock
Host1 -> release lock
Host2 -> renew lock -> throw
It's actually possible for several other combinations to happen (I've seen a race where Host1 renews right as it releases, also causing a throw).
Ultimately, these errors seem benign as they'll eventually recover and all will be well, but it causes a lot of errors in the logs (I see 100k+ 409 logs that involve the host lock over the last week).
The text was updated successfully, but these errors were encountered: