-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Durable Entities stay locked when critical section is left via exception #2743
Comments
Amazing bug report @ranasch, I've been facing the exact same issue and had just made a striped down example and saw this when I came to post. So it's not a complete waste of time here is the basic example I came up with : using System;
using System.Threading.Tasks;
using Microsoft.DurableTask;
using Microsoft.DurableTask.Client;
using Microsoft.DurableTask.Entities;
using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Http;
public class EntityLockExample
{
[Function(nameof(EntityLockExample.EntityLockExampleTrigger))]
public async Task EntityLockExampleTrigger([HttpTrigger(AuthorizationLevel.Function, "get", Route = $"{nameof(EntityLockExample.EntityLockExampleTrigger)}/{{throwError}}")] HttpRequestData requestData, bool? throwError, [DurableClient] DurableTaskClient durableTaskClient)
{
await durableTaskClient.ScheduleNewOrchestrationInstanceAsync(new TaskName(nameof(this.EntityLockExampleOrchestration)), throwError.GetValueOrDefault(false));
}
[Function(nameof(EntityLockExample.EntityLockExampleOrchestration))]
public async Task EntityLockExampleOrchestration([OrchestrationTrigger] TaskOrchestrationContext orchestrationContext, bool throwError)
{
EntityInstanceId entityLock = new EntityInstanceId(nameof(EntityLockExample), "myLock");
await using (await orchestrationContext.Entities.LockEntitiesAsync(entityLock))
{
await orchestrationContext.CallActivityAsync(new TaskName(nameof(this.EntityLockExampleActivity)), throwError);
}
}
[Function(nameof(EntityLockExample.EntityLockExampleActivity))]
public void EntityLockExampleActivity([ActivityTrigger] bool throwError)
{
if (throwError)
{
throw new Exception("Activity Failed");
}
}
} With this code you can run it as many times as you want passing throwError = false I also tried to capture any exceptions inside the critical section and rethrow them outside the using block after the lock had been disposed but this sadly did not help. Another "workaround" I have at the moment is to never allow an orchestration that uses critical sections to enter a failed state. To do this I capture any exception in the activities and store that in a custom state. This is less than ideal though as using the DFMonitor it looks like everything is running fine when really there may be many issues. I have read some other tickets related to this issue and people where asking about the Netherite storage engine which I hoped would solve this issue but sadly I can't get this to work with .net8 Isolated. Might be uncouth but here is a link to that issues in the Netherite repo. Thanks @sebastianburckhardt for already assigning this to yourself, if you have any other ideas or work around that would be ace. |
I have been able to confirm that the current implementation does not properly release the lock when an exception is thrown inside a critical section. I have not yet been able to determine why though. |
The problem seems to be with how I think the solution is to, if there is a failure, first try to properly use the original execution result (if that execution result is also a failure), and only construct a new failure result if the original execution result is not a failure. |
PR #2748 contains a fix for this issue and (barring any complications) can go out with the next release. |
Correction, sadly this issue is not resolved I must have tested it incorrectly... I can confirm with the latest release on nuget this issue is resolved. Thanks @sebastianburckhardt for working on it. |
@sebastianburckhardt @DEndersby91 Not sure how you confirmed it. When I upgraded Microsoft.Azure.Functions.Worker.Extensions.DurableTask from 1.1.1 to 1.1.3 in my sample above, I don´t get a lock anymore. This call does not return:
What am I missing? |
You've made me worried now. I tested the above code I linked in my original commnet and it no longer locks up for me after an error. Let me add some delays inside the activity and queue a few up and see if the lock is taken out. Are you testing with an existing hub that already has a locked up entity? I don't think this fix will correct things historically so some cleaning up might be required. |
No - testing locally with Azurite and tried with existing hub as well as discarding/ creating a new hub from scratch. Same result |
You're 100% right @ranasch, I don't know what I was testing before but this issue is sadly not resolved with the current 1.1.2 release. Ok so reading other issues this is might be caused by entites not getting picked up : #2830 |
1.1.4 seems to fix this for me |
This is happening to me in 1.1.4. I'm occasionally getting errors like this
When this happens, the orchestration fails, and entity locks are not released without manual intervention to restart the orchestration. |
Description
Running .net8 isolated durable function which has a critical section guarded by a durable entity. When the orchestration fails, the entity keeps locked and no further orchestrations can proceed.
Expected behavior
The the critical section (using) is left via exception, the build in dispose should release the lock of the orchestration on the entity, otherwise it stays locked forever and no other orchestration can obtain a lock.
Actual behavior
When the critical section is left via exception, the lock is not released on the entity, because the release is not persisted on the Azure Table.
Relevant source code snippets
The created entity will stay be locked by the critical section, unless you also implement a custom Dispose, which by calling the Entity a final time ensures the unlock state is persisted in the table storage:
Unfortunately this approach has a catch with throwing ambiguous workflow exception. So its not the solution.
Known workarounds
Tried the custom dispose, however that throws this exception:
App Details
Project:
settings:
Screenshots
Lock before disposing:
![image](https://private-user-images.githubusercontent.com/41688434/304495239-f9488f52-2e13-4a0a-b6fe-c62f2a362163.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNzQ1NTIsIm5iZiI6MTczOTM3NDI1MiwicGF0aCI6Ii80MTY4ODQzNC8zMDQ0OTUyMzktZjk0ODhmNTItMmUxMy00YTBhLWI2ZmUtYzYyZjJhMzYyMTYzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDE1MzA1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTZmYzZhMWY0MzljNDY4MDMxZmI0ZmU1MTU4MWY2ODMwMmNjNzVmYzg4OTIyNmY4M2Q5NDVlNWM1NGQ4ZGVjN2MmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.ZQBcOrTVX5TcwcArOQEdw-d7vWN6Zcpz9OeMHsv82UM)
![image](https://private-user-images.githubusercontent.com/41688434/304495425-157c7f36-73f4-4380-bdda-4ee2d015df5b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNzQ1NTIsIm5iZiI6MTczOTM3NDI1MiwicGF0aCI6Ii80MTY4ODQzNC8zMDQ0OTU0MjUtMTU3YzdmMzYtNzNmNC00MzgwLWJkZGEtNGVlMmQwMTVkZjViLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDE1MzA1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTNmZTlhZjc1ZmM4Yzk0YjA0ZDVlZDU2MzEyZGY4YWVjMGY1ZWY4OWE5NTQ3MjkzMjNhYmMzZTlmNDZjNTY2ZjAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.zAJuiK5LxC2mewy4yTEdYai9j94rMk9ETPvHzElgFCM)
Entity unlocked after custom callout in dispose:
If deployed to Azure
This is a simplified excerpt of our production function. For production details reach out to me directly to get a repro.
The text was updated successfully, but these errors were encountered: