Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New install fails to start services due to Win32Exception in Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint #52

Open
crowbar27 opened this issue Oct 18, 2024 · 2 comments

Comments

@crowbar27
Copy link

Problem Description

I was upgrading on premise from HPC Pack 2016 to 2019, but after a successful installation, I cannot connect to the HPC Cluster Manager, because the HPC services are constantly crashing.

Steps to Reproduce

  • I uninstalled all 2016 components on all three head nodes
  • I created a new remote SQL Server instance.
  • I the database populated with the script provided with the installer which ran without error.
  • I ran the HPC Pack installer (first with built-in HA, later I retried with a single note, but the issue is always the same). The installer had no errors nor warnings in the prerequesite check and ran without error. All installer logs have an exit code of 0 at the end.
  • At the end, the installer asks whether I want to connect to the cluster manager. When I do so, the connection times out.

Expected Results

The connection succeeds.

Actual Results

The connection times out, because the HPC services are constantly restarting. Most importantly, it seems that the scheduler cannot start which causes all other services to fail, too. In the event log, I find entries like

The HPC Diagnostics Service service terminated unexpectedly. It has done this 118 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.

Note that the diagnostics service is just an example, there are similar entries for other services including the scheduler.

Immediately before that, it logs event 1000 with details:

Faulting application name: HpcDiagnostics.exe, version: 6.2.7756.0, time stamp: 0x65133346
Faulting module name: KERNELBASE.dll, version: 10.0.14393.7426, time stamp: 0x66f60177
Exception code: 0xe0434352
Fault offset: 0x0000000000026ea8
Faulting process id: 0x2354
Faulting application start time: 0x01db2181b5a6aaae
Faulting application path: C:\Program Files\Microsoft HPC Pack 2019\Bin\HpcDiagnostics.exe
Faulting module path: C:\windows\System32\KERNELBASE.dll
Report Id: cc5f5cc1-938b-4b0c-949a-e806b4a8cc6f
Faulting package full name:
Faulting package-relative application ID:

and before that I get event 1026 from the .NET Runtime:

Application: HpcDiagnostics.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.ComponentModel.Win32Exception
at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(System.String, System.String)
at Microsoft.Hpc.Diagnostics.Store.DiagnosticCrypto+d__23.MoveNext()>
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.Diagnostics.Store.DiagnosticCrypto+d__18.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.Diagnostics.Store.DiagnosticsStore+d__27.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.Diagnostics.DiagnosticsSvc+d__8.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.DiagnosticsWinService.DiagnosticsWinService+<b__2_1>d.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.MembershipDisabled+d__0.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.DiagnosticsWinService.DiagnosticsWinService+<>c__DisplayClass2_0+<b__0>d.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
at System.Threading.ThreadPoolWorkQueue.Dispatch()

Something seems to work, though, because I can see changes being made to the database, most notably for the built-in HA install, stuff was written to HPCHAWitness.

Additional Logs

Using hpctrace, I found that the scheduler is in a loop of:

17:45:04.933 i HpcScheduler 8428 5548 The HPC job scheduler started.
10/15/2024 17:45:05.120 e HpcScheduler 8428 5548 [[ServiceCore].StartSvc] .Exception detail: System.ComponentModel.Win32Exception (0x80004005): Failed to set registry checkpoint on service (error: 0).. at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(String serviceName, String RegPath).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d__29.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d__24.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode, Boolean schedulerOnAzure, String clusterName, String sqlString, String builtInAdmin, String builtInAdminPass, Func2 azureUserPasswordDecryptor, Func2 azureUserPasswordEncryptor).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode).. at Microsoft.Hpc.Scheduler.SchedulerSvc.d__19.MoveNext().Current stack: at Microsoft.Hpc.Scheduler.SchedulerTracingUtil.GenMessageFormat(String message, Object[] args, String e, String& newMessage, Object[]& newArgs).. at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Int32 jobId, Int32 taskId, Int32[] resourceId, String nodeName, Exception e, TraceEventType level, String message, Object[] args).. at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception e, String message, Object[] args).. at Microsoft.Hpc.Scheduler.SchedulerSvc.d__19.MoveNext().. at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[TStateMachine](TStateMachine& stateMachine).. at Microsoft.Hpc.Scheduler.SchedulerSvc.StartSvc(IHpcContext context).. at Microsoft.Hpc.Scheduler.SchedulerService.<b__5_1>d.MoveNext().. at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[TStateMachine](TStateMachine& stateMachine).. at Microsoft.Hpc.Scheduler.SchedulerService.b__5_1().. at Microsoft.Hpc.HighAvailabilityModule.Algorithm.MembershipWithWitness.<>c__DisplayClass45_0.b__0(Object _).. at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem().. at System.Threading.ThreadPoolWorkQueue.Dispatch()..
10/15/2024 17:45:05.136 e HpcTrace 8428 5548 Current Application Domain UnhandledException event invoked: System.ComponentModel.Win32Exception (0x80004005): Failed to set registry checkpoint on service (error: 0).. at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(String serviceName, String RegPath).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d__29.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d__24.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode, Boolean schedulerOnAzure, String clusterName, String sqlString, String builtInAdmin, String builtInAdminPass, Func2 azureUserPasswordDecryptor, Func2 azureUserPasswordEncryptor).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode).. at Microsoft.Hpc.Scheduler.SchedulerSvc.d__19.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.SchedulerService.<b__5_1>d.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.HighAvailabilityModule.Algorithm.MembershipWithWitness.<>c__DisplayClass45_0.b__0(Object _).. at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem().. at System.Threading.ThreadPoolWorkQueue.Dispatch()
10/15/2024 17:45:05.152 i HpcTrace 8428 5548 Cosmos Logger is being closed
10/15/2024 17:46:38.471 w Microsoft.Hpc.HighAvailablity.Algorithm 12556 15320 [2024-10-15T17:46:38.3615810Z][Protocol][06018788-c18f-4aeb-a708-7d8e85d51f2e] Primary down
10/15/2024 17:46:40.549 i HpcScheduler.exe 12556 10052 [GetCertificateValidationCallback] Bypass certificate CN validation.
10/15/2024 17:46:40.549 i HpcScheduler.exe 12556 10052 [GetCertificateValidationCallback] Bypass certificate CN validation.

Additonal Comments

As the call stack contains some crypto stuff, I was first suspecting an issue with the certificate, but it does not work with one from our AD-integrated CA nor with one created using the script provided with the installer. Furthermore, GetKeyAndSalt more suggests an issue with a symmetric encryption algorithm, but I don't know of anything I can influence in this direction.

@crowbar27
Copy link
Author

I was able to attach a remote debugger to the scheduler before it crashed and it seems that some obsolete code for FCM is running here:

[Obsolete("HAUtils is a utility class for failover cluster, do not use it anymore")]
[PermissionSet(SecurityAction.Demand, Name = "FullTrust")]
public class HAUtils

// ...

	public unsafe static void SetGenericServiceRegistryCheckpoint(string serviceName, string RegPath)
	{
		//IL_0014: Expected I8, but got I
		//IL_0026: Expected I8, but got I
		//IL_0048: Expected I, but got I8
		//IL_005c: Expected I, but got I8
		//IL_00d5: Expected I, but got I8
		//IL_0147: Expected I, but got I8
		//IL_0164: Expected I, but got I8
//...

Is there any way to prevent this?

@crowbar27
Copy link
Author

My understanding from the disassembly is that

	public async Task InitKeyAndSalt()
	{
		if (useCache && encryptKey != null)
		{
			return;
		}
		CancellationToken token = HpcContext.Get().CancellationToken;
		string text = await HpcContext.Get().Registry.GetValueAsync<string>("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", keyLocation, token);
		if (string.IsNullOrEmpty(text))
		{
			using (CreateEncryptor())
			{
			}
			await HpcContext.Get().Registry.SetValueAsync("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", keyLocation, Convert.ToBase64String(encryptKey), token);
			await HpcContext.Get().Registry.SetValueAsync("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", initVectorLocation, Convert.ToBase64String(initVector), token);
		}
		else
		{
			encryptKey = Convert.FromBase64String(text);
			initVector = Convert.FromBase64String(await HpcContext.Get().Registry.GetValueAsync<string>("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", initVectorLocation, token));
		}
		if (encryptKey == null)
		{
			throw new InvalidOperationException();
		}
		if (!HAUtils.IsHeadNodeHAClustered())
		{
			return;
		}
		try
		{
			HAUtils.SetGenericServiceRegistryCheckpoint("HpcScheduler", "HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security");
		}
		catch (ApplicationException ex)
		{
			if (((Win32Exception)ex.InnerException).NativeErrorCode != -2147024713)
			{
				throw;
			}
		}
	}

prepares some cryptographic key that is stored in the registry and, provided the head node is in an FCM cluster, creates this snapshot to make sure that all clustered nodes have the same registry data. However, this fails as HPC Pack does not use FCM anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant