-
Notifications
You must be signed in to change notification settings - Fork 56
Opni Training Controller Service
Amartya Chakraborty edited this page Jan 25, 2023
·
4 revisions
This service processes requests from the AIOps gateway plugin and will send over requests to the GPU controller service and CPU Inferencing service.
- Python
- Determine whether a new Deep Learning model is necessary depending on the watchlist.
- Manage the GPU and CPU inferencing services by redirecting logs to the appropriate service.
Component | Type | Description |
---|---|---|
gpu_trainingjob_status | Nats subject | When the training controller service receives “JobStart” from the Nats subject gpu_trainingjob_status, then it will now make the GPU unavailable to be used for inferencing until it receives the JobEnd payload from the GPU controller service upon completion of the training of the new Deep Learning model. |
gpu_service_inference | Nats request/reply subject | When new logs have been received to inference on, the training controller service receives a request to use the GPU for inferencing by the CPU inferencing service. It will return with “YES” if GPU is available to be used for inferencing or “NO” if a Deep Learning model is currently being trained. |
model_status | Nats request/reply subject | Determine whether a model is currently being trained, has already been trained or there has been no model trained at all and send that reply back. |
workload_parameters | Nats request/reply subject | Returns the workloads of which the logs were used for the last training job of a Deep Learning model. |
train_model | Nats request/reply subject | Receives the workload payload from the Opni admin dashboard for the workloads from which logs should be extracted to train a new Deep Learning model. |
train | Nats request/reply subject | If the training controller service has deemed it necessary to train a new Deep Learning model, then it will publish to the train Nats subject to begin training. |
Component | Type | Description |
---|---|---|
train | Nats subject | When the training controller service publishes to the train Nats subject, it sends over the necessary Opensearch query which will be used by the GPU controller to fetch the logs to use for training. |
model_workload_parameters | Nats subject | When it is time to train a new model, publish the latest workloads to the model_workload_parameters Nats subject. |
gpu_training_job_status | Nats subject | When it's time to start the training of a new Deep Learning model, publish the message "JobStart" in bytes form. |
gpu_service_training_internal | Nats subject | When the training controller service receives “JobStart” from the Nats subject gpu_trainingjob_status, then it will now make the GPU unavailable to be used for inferencing until it receives the JobEnd payload from the GPU controller service upon completion of the training of the new Deep Learning model. |
gpu_service_inference_internal | Nats subject | When new logs have been received to inference on, the training controller service receives a request to use the GPU for inferencing by the CPU inferencing service. It will return with “YES” if GPU is available to be used for inferencing or “NO” if a Deep Learning model is currently being trained. |
model_status | Nats request/reply subject | Determine whether a model is currently being trained, has already been trained or there has been no model trained at all and send that reply back. |
workload_parameters | Nats request/reply subject | Returns the workloads of which the logs were used for the last training job of a Deep Learning model. |
train_model | Nats request/reply subject | Receives the workload payload from the Opni admin dashboard for the workloads from which logs should be extracted to train a new Deep Learning model and publishes a reply to the subject. |
train | Nats subject | If the training controller service has deemed it necessary to train a new Deep Learning model, then it will publish to the train Nats subject to begin training. |
gpu_service_running | Nats request/reply | The training controller service sends a request to the GPU controller through the gpu_service_running Nats subject to see if the service is currently available and not down. |
- GPU status is dependent on training controller not being restarted as right now it is a global variable.
- Unit tests
- Integration tests
- e2e tests
- Manual testing
Architecture
- Backends
- Core Components