Skip to content

Opni Training Controller Service

Amartya Chakraborty edited this page Jan 25, 2023 · 4 revisions

Opni Training Controller Service

Description

This service processes requests from the AIOps gateway plugin and will send over requests to the GPU controller service and CPU Inferencing service.

Programming Languages

  • Python

Diagram

Training Controller Service (2)

Responsibilities

  • Determine whether a new Deep Learning model is necessary depending on the watchlist.
  • Manage the GPU and CPU inferencing services by redirecting logs to the appropriate service.

Input and output interfaces

Input

Component Type Description
gpu_trainingjob_status Nats subject When the training controller service receives “JobStart” from the Nats subject gpu_trainingjob_status, then it will now make the GPU unavailable to be used for inferencing until it receives the JobEnd payload from the GPU controller service upon completion of the training of the new Deep Learning model.
gpu_service_inference Nats request/reply subject When new logs have been received to inference on, the training controller service receives a request to use the GPU for inferencing by the CPU inferencing service. It will return with “YES” if GPU is available to be used for inferencing or “NO” if a Deep Learning model is currently being trained.
model_status Nats request/reply subject Determine whether a model is currently being trained, has already been trained or there has been no model trained at all and send that reply back.
workload_parameters Nats request/reply subject Returns the workloads of which the logs were used for the last training job of a Deep Learning model.
train_model Nats request/reply subject Receives the workload payload from the Opni admin dashboard for the workloads from which logs should be extracted to train a new Deep Learning model.
train Nats request/reply subject If the training controller service has deemed it necessary to train a new Deep Learning model, then it will publish to the train Nats subject to begin training.

Output

Component Type Description
train Nats subject When the training controller service publishes to the train Nats subject, it sends over the necessary Opensearch query which will be used by the GPU controller to fetch the logs to use for training.
model_workload_parameters Nats subject When it is time to train a new model, publish the latest workloads to the model_workload_parameters Nats subject.
gpu_training_job_status Nats subject When it's time to start the training of a new Deep Learning model, publish the message "JobStart" in bytes form.
gpu_service_training_internal Nats subject When the training controller service receives “JobStart” from the Nats subject gpu_trainingjob_status, then it will now make the GPU unavailable to be used for inferencing until it receives the JobEnd payload from the GPU controller service upon completion of the training of the new Deep Learning model.
gpu_service_inference_internal Nats subject When new logs have been received to inference on, the training controller service receives a request to use the GPU for inferencing by the CPU inferencing service. It will return with “YES” if GPU is available to be used for inferencing or “NO” if a Deep Learning model is currently being trained.
model_status Nats request/reply subject Determine whether a model is currently being trained, has already been trained or there has been no model trained at all and send that reply back.
workload_parameters Nats request/reply subject Returns the workloads of which the logs were used for the last training job of a Deep Learning model.
train_model Nats request/reply subject Receives the workload payload from the Opni admin dashboard for the workloads from which logs should be extracted to train a new Deep Learning model and publishes a reply to the subject.
train Nats subject If the training controller service has deemed it necessary to train a new Deep Learning model, then it will publish to the train Nats subject to begin training.
gpu_service_running Nats request/reply The training controller service sends a request to the GPU controller through the gpu_service_running Nats subject to see if the service is currently available and not down.

Restrictions/limitations

  • GPU status is dependent on training controller not being restarted as right now it is a global variable.

Performance issues

Test plan

  • Unit tests
  • Integration tests
  • e2e tests
  • Manual testing
Clone this wiki locally