Opni Training Controller Service

Description

This service processes requests from the AIOps gateway plugin and will send over requests to the GPU controller service and CPU Inferencing service.

Programming Languages

Python

Diagram

Training Controller Service (2)

Responsibilities

Determine whether a new Deep Learning model is necessary depending on the watchlist.
Manage the GPU and CPU inferencing services by redirecting logs to the appropriate service.

Input and output interfaces

Input

Component	Type	Description
gpu_trainingjob_status	Nats subject	When the training controller service receives “JobStart” from the Nats subject gpu_trainingjob_status, then it will now make the GPU unavailable to be used for inferencing until it receives the JobEnd payload from the GPU controller service upon completion of the training of the new Deep Learning model.
gpu_service_inference	Nats request/reply subject	When new logs have been received to inference on, the training controller service receives a request to use the GPU for inferencing by the CPU inferencing service. It will return with “YES” if GPU is available to be used for inferencing or “NO” if a Deep Learning model is currently being trained.
model_status	Nats request/reply subject	Determine whether a model is currently being trained, has already been trained or there has been no model trained at all and send that reply back.
workload_parameters	Nats request/reply subject	Returns the workloads of which the logs were used for the last training job of a Deep Learning model.
train_model	Nats request/reply subject	Receives the workload payload from the Opni admin dashboard for the workloads from which logs should be extracted to train a new Deep Learning model.
train	Nats request/reply subject	If the training controller service has deemed it necessary to train a new Deep Learning model, then it will publish to the train Nats subject to begin training.

Output

Component	Type	Description
train	Nats subject	When the training controller service publishes to the train Nats subject, it sends over the necessary Opensearch query which will be used by the GPU controller to fetch the logs to use for training.
model_workload_parameters	Nats subject	When it is time to train a new model, publish the latest workloads to the model_workload_parameters Nats subject.
gpu_training_job_status	Nats subject	When it's time to start the training of a new Deep Learning model, publish the message "JobStart" in bytes form.
gpu_service_training_internal	Nats subject	When the training controller service receives “JobStart” from the Nats subject gpu_trainingjob_status, then it will now make the GPU unavailable to be used for inferencing until it receives the JobEnd payload from the GPU controller service upon completion of the training of the new Deep Learning model.
gpu_service_inference_internal	Nats subject	When new logs have been received to inference on, the training controller service receives a request to use the GPU for inferencing by the CPU inferencing service. It will return with “YES” if GPU is available to be used for inferencing or “NO” if a Deep Learning model is currently being trained.
model_status	Nats request/reply subject	Determine whether a model is currently being trained, has already been trained or there has been no model trained at all and send that reply back.
workload_parameters	Nats request/reply subject	Returns the workloads of which the logs were used for the last training job of a Deep Learning model.
train_model	Nats request/reply subject	Receives the workload payload from the Opni admin dashboard for the workloads from which logs should be extracted to train a new Deep Learning model and publishes a reply to the subject.
train	Nats subject	If the training controller service has deemed it necessary to train a new Deep Learning model, then it will publish to the train Nats subject to begin training.
gpu_service_running	Nats request/reply	The training controller service sends a request to the GPU controller through the gpu_service_running Nats subject to see if the service is currently available and not down.

Restrictions/limitations

GPU status is dependent on training controller not being restarted as right now it is a global variable.

Performance issues

Test plan

Unit tests
Integration tests
e2e tests
Manual testing

Architecture

Backends
Core Components
- Opni Gateway
- Opni Agent

How Tos

Releases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opni Training Controller Service

Opni Training Controller Service

Description

Programming Languages

Diagram

Responsibilities

Input and output interfaces

Input

Output

Restrictions/limitations

Performance issues

Test plan

Clone this wiki locally