-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training errors - nans #19
Comments
Could you provide a minimal reproducible example for this error? |
I am digging in right now and looks like the error comes from the function loc = reduce(
id_mask * reduce(target * observed_mask, "... seq dim -> ... 1 seq", "sum"),
"... seq1 seq2 -> ... seq1 1",
"sum",
) None of the element of this reduce contains NaNs:
but
actually is the right inner
Edit: More details... I discovered that some values of my target are loaded as
|
Sorry for my previous message @gorold, I should have digged a bit more before writing here. 😢
It appears randomly after some steps (can be even after one epoch). Do you have an idea? |
Hey @Abelarm, this should be caused by a backward pass performing a gradient update with a NaN value. It could be caused by an inf loss perhaps... One way to debug this to detect any bad values in the training_step and saving all inputs and weights. |
You could try adding the |
When run with the detect anomly I got: Keep in mind that I am under MacOS with arm and I needed to change a couple of |
I think running it on cpu would result in a more helpful error message. But I can't really help any further without being able to reproduce this error. |
Just an heads-up on this: All the problems come from the this: Then I tried to fix it by changing the line 46 of the file So I don't know if you had in mind the support for native MacOS but for now it look like is not possible to train on apple hardware. |
Thanks for the resolution! You could remove the I don't think we'll support MPS for now. |
I have also gotten nan'd weights when doing pretraining on A100s and a 3090 ti when using the lotsa or gluonts datasets. The distributions projections handle only matching patch sizes -- I think we're getting unlucky sometimes where there are a few good samples for patch size x, few good ones for patch size y, and then 1 sample for patch size y made of all 0s where basically all actual data got masked out. I haven't narrowed it down yet, but I think something like this might be happening. All 0s leads to 0 variance and some of the distribution code divides by variance. I think it gets clamped to dtype's epsilon, but that might result in huge magnitudes that turn into infs & nan's somewhere. They happen pretty rarely, so it's hard to repro reliably |
Hey @fmmoret, thanks for reporting this, we've also seen this occasionally. Another possible reason this could be happening could be the attention layer, if all tokens are masked. Re-opening this issue to track this pre-training issue. |
For this case, a quick fix is to remove the outlier samples by adding one line
in |
Hi,
First of all thank for the great work!
everything works almost plug-and-play for inference.
I am having trouble while fine-tuning I usually got 2 errors:
logits: torch.Size([16, 512, 128])) to satisfy the constraint GreaterThan(lower_bound=0.0)
or
Categorical(logits: torch.Size([16, 512, 128, 4])) to satisfy the constraint IndependentConstraint(Real(), 1)
in this case because there are some nulls in the tensor.Do you have any idea how to solve it? or it's because there is something bad in my data?
The text was updated successfully, but these errors were encountered: