Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure SQL DTC bug #2970

Open
Abrissirba opened this issue Nov 4, 2024 · 7 comments · May be fixed by #3019
Open

Azure SQL DTC bug #2970

Abrissirba opened this issue Nov 4, 2024 · 7 comments · May be fixed by #3019
Assignees
Labels
🐛 Bug! Issues that are bugs in the drivers we maintain. ✔️ Triage Done Issues that are triaged by dev team and are in investigation.
Milestone

Comments

@Abrissirba
Copy link

Abrissirba commented Nov 4, 2024

We have an issue that have scared us and that has taken us some time to understand.
I am not sure if this is an issue that belongs to SqlClient or System.Transaction or Azure SQL

The bug first appeared when moving our on-prem solution up to Azure.
Going from windows to Azure Containers Apps running on linux and moving from SQL Server On-Prem to an Azure SQL database.

Our code is using TransactionScope to make sure everything is commited in one batch.

A regular flow is that at the begining of execution, a TransactionScope is created and record is inserted into a table (using ADO.NET) and then the flow continues to handle busniess logic. (Outbox with TransactionScope in NServicBus)

The business logic code use Entity Framework 6 to communicate with the DB.

At the end of execution the first connection will try to update the outbox record it created and complete the TransactionScope.
Since we have two connecitons open simultanously the transaction will escalate to DTC.
From my understanding Azure SQL will handle this DTC transaction in something called Elastic Transaction.

During our first tests we had a low SKU on the database and some parts of our db communication (heavy inserts) took very long time to execute.
At most it could take up to 40 minutes.

During these long running transactions we got an error at then end when updating the outbox record. This would throw an error leading to the transaction being rolled back.
So far so good.

What suprised us was that we could see data in the database that had been commited during this execution that should have been rolled back but instead had been commited.
Scary and confusing!

After investigating the error, stating a broken connection, we found out that we used the proxy connection policy in azure.
This connection policy seems to close connections that has been idle for 30 minutes.
Redirect is the recomended policy by microsoft but proxy is the defualt in many scenarios, including when you use private endpoints which is the case for us.
After changing to redirect, and increasing the SKU so that the inserts doesn't take so long to execute, we have not seen this behaviour any more.

However, we could not sleep as good as we would like knowing that this error might still occour, ending up corrupting the database.

So I started digging into the code to find a way to recreate this.
This is what I have found so far

  • When the first connection is inserting a record in the the db using connection 1 a transaction will be created in the db.
  • When EF opens a connection, a second transaction will be created and it will be enlisted to the transaction created by connection 1.
  • There is now two transactions in the db. One with is_local = 1 and one is_enlisted = 1.
  • If the session that holds the is_local transaction dies for some reason, in our case being idle for to long
    AND
    the second connection Close and Open again, the transaction will just dissappear from the DB and the connection will start instering to the db and commit immideatly.

This only happens when we use an Azure SQL database. If we use a local db on windows, which will escalate using MSDTC, an error is thrown when the first connection/session dies.

I have been able to reproduce with the example below using the lastest version of both Microsoft.Data.SqlClient and System.Data.SqlClient against an Azure SQL db.

The code below can be used to recreate the issue.

using Microsoft.Data.SqlClient;
using System.Transactions;
// The sql below should be executed in ssms or something similiar to get information about the connections/transactions
// The select query will return the current active transactions
// the transaction with is_local = 1 should then be killed to simulate a connection error of the connection that holds the dtc coordination.
//
//SELECT
//    st.session_id,
//    st.transaction_id,
//	st.is_local,
//	st.is_enlisted,
//	s.client_interface_name
//FROM sys.dm_tran_session_transactions st
//JOIN sys.dm_exec_sessions s ON st.session_id = s.session_id
//WHERE st.transaction_id IN 
//(select Transaction_id from sys.dm_tran_active_transactions where name = 'user_transaction')

//kill <id of the session that has is_local set to 1>

var options = new TransactionOptions
{
	IsolationLevel = System.Transactions.IsolationLevel.Serializable,
	Timeout = TimeSpan.Zero // TimeSpan.Zero is default of `TransactionOptions.Timeout`
};
var scope = new TransactionScope(TransactionScopeOption.RequiresNew, options, TransactionScopeAsyncFlowOption.Enabled);
TransactionManager.ImplicitDistributedTransactions = true;

var connectionString = "Server=.;Database=<DB-name>;Trusted_Connection=True;TrustServerCertificate=True";

var name = "Test8";

// First connection, this will have the transaction as local
{
	var conn1 = new SqlConnection(connectionString);
	conn1.Open();
	var cmd = conn1.CreateCommand();
	cmd.CommandText = $"update Countries set Name = '{name}' where Id = 1";
	_ = await cmd.ExecuteNonQueryAsync();
	//conn1.Close(); // adding this line will prevent the transaction to be escalated to distributed.
					 // In our case we cannot not do this since this connection is handled by third-party code. 
}

// Second connectionm, this will éscalate the transaction as distributed
// This should fail if the transaction dies
{
	var conn2 = new SqlConnection(connectionString);
	conn2.Open();
	{
		// if this block is removed/commented out, an error will be thrown (as expected)
		// It seems like the open/close somehow breaks out of the transactionscope.
		//Microsoft.Data.SqlClient.SqlException - Distributed transaction completed. Either enlist this session in a new transaction or the NULL transaction.
		conn2.Close();
		conn2.Open();
	}
	
	// KILL THE SESSION THAT HOLDS THE is_local TRANSACTION HERE
	var cmd2 = conn2.CreateCommand();
	cmd2.CommandText = $"update Countries set Name = '{name}' where Id = 2";
	_ = await cmd2.ExecuteNonQueryAsync();
}

scope.Complete();
scope.Dispose();
@cheenamalhotra cheenamalhotra added the 🆕 Triage Needed For new issues, not triaged yet. label Nov 5, 2024
@samsharma2700 samsharma2700 added 🐛 Bug! Issues that are bugs in the drivers we maintain. ✔️ Triage Done Issues that are triaged by dev team and are in investigation. and removed 🆕 Triage Needed For new issues, not triaged yet. labels Nov 5, 2024
@mdaigle mdaigle self-assigned this Nov 12, 2024
@mdaigle
Copy link
Contributor

mdaigle commented Nov 12, 2024

Hi, @Abrissirba, I'm taking a look at this and will keep you updated on my findings.
In the meantime, can you please let me know which version of Microsoft.Data.SqlClient you're using?

Thanks!

@Abrissirba
Copy link
Author

I used 5.2.2 of Microsoft.Data.SqlClient

@mdaigle
Copy link
Contributor

mdaigle commented Nov 14, 2024

Hi @Abrissirba, I wanted to share where I'm at to keep you updated. Your repro code helped me verify the issue (thank you!) and I've determined a root cause. I'm currently evaluating potential solutions and will share again when we're ready to go forward with a fix.

@Abrissirba
Copy link
Author

Thanks for the update and happy that my repro code helped.

@cheenamalhotra cheenamalhotra added this to the 6.0-preview3 milestone Nov 15, 2024
@Abrissirba
Copy link
Author

Abrissirba commented Nov 25, 2024

Any updates on this? I see that there is a PR which is great! 🥇 But the milestone 7.0 is making me a bit sad :(

@mdaigle
Copy link
Contributor

mdaigle commented Nov 25, 2024

I'd hoped to get this in for the 6.0 release, but sadly it still has work pending and it's landing too close to the cutoff date. We've discussed doing a hotfix for this change, so it should come shortly. Thanks for your patience!

@Abrissirba
Copy link
Author

Thanks, and I do understand. From my perspective this feels like a hotfix so it brings me back to happiness to see that you have already discussed that :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 Bug! Issues that are bugs in the drivers we maintain. ✔️ Triage Done Issues that are triaged by dev team and are in investigation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants