Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] onnxruntime-training example has python import error #14637

Closed
CapJunkrat opened this issue Feb 9, 2023 · 8 comments
Closed

[Training] onnxruntime-training example has python import error #14637

CapJunkrat opened this issue Feb 9, 2023 · 8 comments
Assignees
Labels
ep:CUDA issues related to the CUDA execution provider training issues related to ONNX Runtime training; typically submitted using template

Comments

@CapJunkrat
Copy link

Describe the issue

I'm trying the onnxruntime-training examples from https://github.com/microsoft/onnxruntime-training-examples/blob/master/on_device_training/training_api_demo/mnist_training_example.ipynb
and got error:

File "/home/users/user/.local/lib/python3.8/site-packages/onnxruntime/training/api/module.py", line 128, in Module
def export_model_for_inferencing(self, inference_model_uri: str, graph_output_names: list[str]) -> None:
TypeError: 'type' object is not subscriptable

It also happened when I tried to use the nightly build. I've tried the following CPU versions and got the same result.
onnxruntime_training-1.15.0.dev20230201001+cpu-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
onnxruntime_training-1.15.0.dev20230207001+cpu-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

I'm using CentOS 7 system, python38, with onnxruntime version 1.13.1

To reproduce

I'm using CentOS 7 system, python38, with onnxruntime version 1.13.1 and run examples from https://github.com/microsoft/onnxruntime-training-examples/blob/master/on_device_training/training_api_demo/mnist_training_example.ipynb

Urgency

No response

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.13.1

PyTorch Version

1.11.0

Execution Provider

Default CPU, CUDA

Execution Provider Library Version

Cuda 11.6 and CPU versions

@CapJunkrat CapJunkrat added the training issues related to ONNX Runtime training; typically submitted using template label Feb 9, 2023
@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Feb 9, 2023
@baijumeswani
Copy link
Contributor

@CapJunkrat apologies. This was tested against python 3.9. We didn't test with python 3.8.

I will fix it in a PR. Thanks for reporting this.

@CapJunkrat
Copy link
Author

Glad to hear. Thank you!

@CapJunkrat
Copy link
Author

CapJunkrat commented Feb 10, 2023

There are actually one more issue associated with this example.
in the train function, model.reset_grad() was used, but Module does not have reset_grad() any more, and it seems like being replaced by lazy_reset_grad. Should it use lazy_reset_grad() instead?

@jingyanwangms
Copy link
Contributor

jingyanwangms commented Feb 21, 2023

There are actually one more issue associated with this example. in the train function, model.reset_grad() was used, but Module does not have reset_grad() any more, and it seems like being replaced by lazy_reset_grad. Should it use lazy_reset_grad() instead?

Looks like reset_grad is replaced by lazy_reset_grad. We should update the example with updated API. Did you give lazy_reset_grad a try?

@CapJunkrat
Copy link
Author

There are actually one more issue associated with this example. in the train function, model.reset_grad() was used, but Module does not have reset_grad() any more, and it seems like being replaced by lazy_reset_grad. Should it use lazy_reset_grad() instead?

Looks like reset_grad is replaced by lazy_reset_grad. We should update the example with updated API. Did you give lazy_reset_grad a try?

yes, the code worked and the result seems correct too. May I ask what is the difference between these 2 APIs?

@baijumeswani
Copy link
Contributor

baijumeswani commented Feb 22, 2023

sorry, i didn't answer the question in time.

the training api does not reset the gradient when lazy_reset_grad is called. It instead saves an internal flag, such that at the time of next graph execution (when train_step is called), the gradients are zeroed out before being written to.
In essence, the gradients are reset lazily (not greedily). This is what prompted us to change the name of the function.

The lazy_reset_grad helps us avoid iterating over all the gradients and reset every element to 0. Since these elements will need to be computed at the time of next train_step, the reset can happen at that point.

Hope that helps.

@baijumeswani
Copy link
Contributor

Having said that, there was another idea to expose a function that would greedily reset the gradient (should the user ever need it). This has not been implemented yet.

@CapJunkrat
Copy link
Author

Got it! Thank you very much for the explaination @baijumeswani . I believe my problem here is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

No branches or pull requests

4 participants