[Training] onnxruntime-training example has python import error #14637

CapJunkrat · 2023-02-09T03:02:04Z

Describe the issue

I'm trying the onnxruntime-training examples from https://github.com/microsoft/onnxruntime-training-examples/blob/master/on_device_training/training_api_demo/mnist_training_example.ipynb
and got error:

File "/home/users/user/.local/lib/python3.8/site-packages/onnxruntime/training/api/module.py", line 128, in Module
def export_model_for_inferencing(self, inference_model_uri: str, graph_output_names: list[str]) -> None:
TypeError: 'type' object is not subscriptable

It also happened when I tried to use the nightly build. I've tried the following CPU versions and got the same result.
onnxruntime_training-1.15.0.dev20230201001+cpu-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
onnxruntime_training-1.15.0.dev20230207001+cpu-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

I'm using CentOS 7 system, python38, with onnxruntime version 1.13.1

To reproduce

I'm using CentOS 7 system, python38, with onnxruntime version 1.13.1 and run examples from https://github.com/microsoft/onnxruntime-training-examples/blob/master/on_device_training/training_api_demo/mnist_training_example.ipynb

Urgency

No response

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.13.1

PyTorch Version

1.11.0

Execution Provider

Default CPU, CUDA

Execution Provider Library Version

Cuda 11.6 and CPU versions

baijumeswani · 2023-02-09T23:07:14Z

@CapJunkrat apologies. This was tested against python 3.9. We didn't test with python 3.8.

I will fix it in a PR. Thanks for reporting this.

CapJunkrat · 2023-02-10T01:40:23Z

Glad to hear. Thank you!

CapJunkrat · 2023-02-10T03:02:52Z

There are actually one more issue associated with this example.
in the train function, model.reset_grad() was used, but Module does not have reset_grad() any more, and it seems like being replaced by lazy_reset_grad. Should it use lazy_reset_grad() instead?

jingyanwangms · 2023-02-21T20:44:42Z

There are actually one more issue associated with this example. in the train function, model.reset_grad() was used, but Module does not have reset_grad() any more, and it seems like being replaced by lazy_reset_grad. Should it use lazy_reset_grad() instead?

Looks like reset_grad is replaced by lazy_reset_grad. We should update the example with updated API. Did you give lazy_reset_grad a try?

CapJunkrat · 2023-02-22T03:18:31Z

There are actually one more issue associated with this example. in the train function, model.reset_grad() was used, but Module does not have reset_grad() any more, and it seems like being replaced by lazy_reset_grad. Should it use lazy_reset_grad() instead?

Looks like reset_grad is replaced by lazy_reset_grad. We should update the example with updated API. Did you give lazy_reset_grad a try?

yes, the code worked and the result seems correct too. May I ask what is the difference between these 2 APIs?

baijumeswani · 2023-02-22T05:03:25Z

sorry, i didn't answer the question in time.

the training api does not reset the gradient when lazy_reset_grad is called. It instead saves an internal flag, such that at the time of next graph execution (when train_step is called), the gradients are zeroed out before being written to.
In essence, the gradients are reset lazily (not greedily). This is what prompted us to change the name of the function.

The lazy_reset_grad helps us avoid iterating over all the gradients and reset every element to 0. Since these elements will need to be computed at the time of next train_step, the reset can happen at that point.

Hope that helps.

baijumeswani · 2023-02-22T05:07:25Z

Having said that, there was another idea to expose a function that would greedily reset the gradient (should the user ever need it). This has not been implemented yet.

CapJunkrat · 2023-02-22T05:09:59Z

Got it! Thank you very much for the explaination @baijumeswani . I believe my problem here is solved.

CapJunkrat added the training issues related to ONNX Runtime training; typically submitted using template label Feb 9, 2023

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Feb 9, 2023

baijumeswani assigned AdamLouly and baijumeswani Feb 9, 2023

baijumeswani mentioned this issue Feb 9, 2023

Update typing hints to support python 3.8 #14649

Merged

jingyanwangms mentioned this issue Feb 21, 2023

onnxruntime-training CPU-Overnight version error microsoft/onnxruntime-training-examples#100

Closed

baijumeswani closed this as completed Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] onnxruntime-training example has python import error #14637

[Training] onnxruntime-training example has python import error #14637

CapJunkrat commented Feb 9, 2023

baijumeswani commented Feb 9, 2023

CapJunkrat commented Feb 10, 2023

CapJunkrat commented Feb 10, 2023 •

edited

Loading

jingyanwangms commented Feb 21, 2023 •

edited

Loading

CapJunkrat commented Feb 22, 2023

baijumeswani commented Feb 22, 2023 •

edited

Loading

baijumeswani commented Feb 22, 2023

CapJunkrat commented Feb 22, 2023

[Training] onnxruntime-training example has python import error #14637

[Training] onnxruntime-training example has python import error #14637

Comments

CapJunkrat commented Feb 9, 2023

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

baijumeswani commented Feb 9, 2023

CapJunkrat commented Feb 10, 2023

CapJunkrat commented Feb 10, 2023 • edited Loading

jingyanwangms commented Feb 21, 2023 • edited Loading

CapJunkrat commented Feb 22, 2023

baijumeswani commented Feb 22, 2023 • edited Loading

baijumeswani commented Feb 22, 2023

CapJunkrat commented Feb 22, 2023

CapJunkrat commented Feb 10, 2023 •

edited

Loading

jingyanwangms commented Feb 21, 2023 •

edited

Loading

baijumeswani commented Feb 22, 2023 •

edited

Loading