You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New install fails to start services due to Win32Exception in Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint
#52
Open
crowbar27 opened this issue
Oct 18, 2024
· 2 comments
I was upgrading on premise from HPC Pack 2016 to 2019, but after a successful installation, I cannot connect to the HPC Cluster Manager, because the HPC services are constantly crashing.
Steps to Reproduce
I uninstalled all 2016 components on all three head nodes
I created a new remote SQL Server instance.
I the database populated with the script provided with the installer which ran without error.
I ran the HPC Pack installer (first with built-in HA, later I retried with a single note, but the issue is always the same). The installer had no errors nor warnings in the prerequesite check and ran without error. All installer logs have an exit code of 0 at the end.
At the end, the installer asks whether I want to connect to the cluster manager. When I do so, the connection times out.
Expected Results
The connection succeeds.
Actual Results
The connection times out, because the HPC services are constantly restarting. Most importantly, it seems that the scheduler cannot start which causes all other services to fail, too. In the event log, I find entries like
The HPC Diagnostics Service service terminated unexpectedly. It has done this 118 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.
Note that the diagnostics service is just an example, there are similar entries for other services including the scheduler.
Immediately before that, it logs event 1000 with details:
and before that I get event 1026 from the .NET Runtime:
Application: HpcDiagnostics.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.ComponentModel.Win32Exception
at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(System.String, System.String)
at Microsoft.Hpc.Diagnostics.Store.DiagnosticCrypto+d__23.MoveNext()>
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.Diagnostics.Store.DiagnosticCrypto+d__18.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.Diagnostics.Store.DiagnosticsStore+d__27.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.Diagnostics.DiagnosticsSvc+d__8.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.DiagnosticsWinService.DiagnosticsWinService+<b__2_1>d.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.MembershipDisabled+d__0.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
at Microsoft.Hpc.DiagnosticsWinService.DiagnosticsWinService+<>c__DisplayClass2_0+<b__0>d.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
Something seems to work, though, because I can see changes being made to the database, most notably for the built-in HA install, stuff was written to HPCHAWitness.
Additional Logs
Using hpctrace, I found that the scheduler is in a loop of:
17:45:04.933 i HpcScheduler 8428 5548 The HPC job scheduler started.
10/15/2024 17:45:05.120 e HpcScheduler 8428 5548 [[ServiceCore].StartSvc] .Exception detail: System.ComponentModel.Win32Exception (0x80004005): Failed to set registry checkpoint on service (error: 0).. at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(String serviceName, String RegPath).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d__29.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d__24.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode, Boolean schedulerOnAzure, String clusterName, String sqlString, String builtInAdmin, String builtInAdminPass, Func2 azureUserPasswordDecryptor, Func2 azureUserPasswordEncryptor).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode).. at Microsoft.Hpc.Scheduler.SchedulerSvc.d__19.MoveNext().Current stack: at Microsoft.Hpc.Scheduler.SchedulerTracingUtil.GenMessageFormat(String message, Object[] args, String e, String& newMessage, Object[]& newArgs).. at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Int32 jobId, Int32 taskId, Int32[] resourceId, String nodeName, Exception e, TraceEventType level, String message, Object[] args).. at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception e, String message, Object[] args).. at Microsoft.Hpc.Scheduler.SchedulerSvc.d__19.MoveNext().. at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[TStateMachine](TStateMachine& stateMachine).. at Microsoft.Hpc.Scheduler.SchedulerSvc.StartSvc(IHpcContext context).. at Microsoft.Hpc.Scheduler.SchedulerService.<b__5_1>d.MoveNext().. at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[TStateMachine](TStateMachine& stateMachine).. at Microsoft.Hpc.Scheduler.SchedulerService.b__5_1().. at Microsoft.Hpc.HighAvailabilityModule.Algorithm.MembershipWithWitness.<>c__DisplayClass45_0.b__0(Object _).. at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem().. at System.Threading.ThreadPoolWorkQueue.Dispatch()..
10/15/2024 17:45:05.136 e HpcTrace 8428 5548 Current Application Domain UnhandledException event invoked: System.ComponentModel.Win32Exception (0x80004005): Failed to set registry checkpoint on service (error: 0).. at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(String serviceName, String RegPath).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d__29.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d__24.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode, Boolean schedulerOnAzure, String clusterName, String sqlString, String builtInAdmin, String builtInAdminPass, Func2 azureUserPasswordDecryptor, Func2 azureUserPasswordEncryptor).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode).. at Microsoft.Hpc.Scheduler.SchedulerSvc.d__19.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.SchedulerService.<b__5_1>d.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.HighAvailabilityModule.Algorithm.MembershipWithWitness.<>c__DisplayClass45_0.b__0(Object _).. at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem().. at System.Threading.ThreadPoolWorkQueue.Dispatch()
10/15/2024 17:45:05.152 i HpcTrace 8428 5548 Cosmos Logger is being closed
10/15/2024 17:46:38.471 w Microsoft.Hpc.HighAvailablity.Algorithm 12556 15320 [2024-10-15T17:46:38.3615810Z][Protocol][06018788-c18f-4aeb-a708-7d8e85d51f2e] Primary down
10/15/2024 17:46:40.549 i HpcScheduler.exe 12556 10052 [GetCertificateValidationCallback] Bypass certificate CN validation.
10/15/2024 17:46:40.549 i HpcScheduler.exe 12556 10052 [GetCertificateValidationCallback] Bypass certificate CN validation.
Additonal Comments
As the call stack contains some crypto stuff, I was first suspecting an issue with the certificate, but it does not work with one from our AD-integrated CA nor with one created using the script provided with the installer. Furthermore, GetKeyAndSalt more suggests an issue with a symmetric encryption algorithm, but I don't know of anything I can influence in this direction.
The text was updated successfully, but these errors were encountered:
I was able to attach a remote debugger to the scheduler before it crashed and it seems that some obsolete code for FCM is running here:
[Obsolete("HAUtils is a utility class for failover cluster, do not use it anymore")][PermissionSet(SecurityAction.Demand,Name="FullTrust")]publicclass HAUtils
// ...
public unsafestaticvoidSetGenericServiceRegistryCheckpoint(stringserviceName,stringRegPath){//IL_0014: Expected I8, but got I//IL_0026: Expected I8, but got I//IL_0048: Expected I, but got I8//IL_005c: Expected I, but got I8//IL_00d5: Expected I, but got I8//IL_0147: Expected I, but got I8//IL_0164: Expected I, but got I8//...
prepares some cryptographic key that is stored in the registry and, provided the head node is in an FCM cluster, creates this snapshot to make sure that all clustered nodes have the same registry data. However, this fails as HPC Pack does not use FCM anymore.
Problem Description
I was upgrading on premise from HPC Pack 2016 to 2019, but after a successful installation, I cannot connect to the HPC Cluster Manager, because the HPC services are constantly crashing.
Steps to Reproduce
Expected Results
The connection succeeds.
Actual Results
The connection times out, because the HPC services are constantly restarting. Most importantly, it seems that the scheduler cannot start which causes all other services to fail, too. In the event log, I find entries like
Note that the diagnostics service is just an example, there are similar entries for other services including the scheduler.
Immediately before that, it logs event 1000 with details:
and before that I get event 1026 from the .NET Runtime:
Something seems to work, though, because I can see changes being made to the database, most notably for the built-in HA install, stuff was written to HPCHAWitness.
Additional Logs
Using hpctrace, I found that the scheduler is in a loop of:
Additonal Comments
As the call stack contains some crypto stuff, I was first suspecting an issue with the certificate, but it does not work with one from our AD-integrated CA nor with one created using the script provided with the installer. Furthermore,
GetKeyAndSalt
more suggests an issue with a symmetric encryption algorithm, but I don't know of anything I can influence in this direction.The text was updated successfully, but these errors were encountered: