Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"make train" stuck at Training #140

Open
ravidborse opened this issue Aug 28, 2023 · 1 comment
Open

"make train" stuck at Training #140

ravidborse opened this issue Aug 28, 2023 · 1 comment

Comments

@ravidborse
Copy link

From last one hour it's stuck at

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 402.99it/s]
08/28/2023 18:58:48 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /transformers_cache/spider/spider/1.0.0/df8615a31625b12f701e3840f2502d74f4b533dc60aa364a1f48cfd198acc326/cache-7e03875afb379451.arrow
08/28/2023 18:58:48 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /transformers_cache/spider/spider/1.0.0/df8615a31625b12f701e3840f2502d74f4b533dc60aa364a1f48cfd198acc326/cache-06decf315ea7a716.arrow
08/28/2023 18:58:49 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /transformers_cache/spider/spider/1.0.0/df8615a31625b12f701e3840f2502d74f4b533dc60aa364a1f48cfd198acc326/cache-6ef067fed50d786a.arrow
08/28/2023 18:58:49 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /transformers_cache/spider/spider/1.0.0/df8615a31625b12f701e3840f2502d74f4b533dc60aa364a1f48cfd198acc326/cache-e3414ffb7b73b322.arrow
08/28/2023 18:58:51 - WARNING - seq2seq.utils.dataset_loader - The split train of the dataset spider contains 8 duplicates out of 7000 examples
***** Running training *****
Num examples = 7000
Num Epochs = 3072
Instantaneous batch size per device = 5
Total train batch size (w. parallel, distributed & accumulation) = 2050
Gradient Accumulation steps = 410
Total optimization steps = 9216
0%| | 0/9216 [00:00<?, ?it/s]

@ravidborse
Copy link
Author

Actually its stuck at Torch Autograd Backward

***** Running training *****
Num examples = 7000
Num Epochs = 3072
Instantaneous batch size per device = 5
Total train batch size (w. parallel, distributed & accumulation) = 2050
Gradient Accumulation steps = 410
Total optimization steps = 9216
0%| | 0/9216 [00:00<?, ?it/s]^CTraceback (most recent call last):
File "seq2seq/run_seq2seq.py", line 271, in
main()
File "seq2seq/run_seq2seq.py", line 216, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1400, in train
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 2002, in training_step
loss.backward()
File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant