Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training errors - nans #19

Open
Abelarm opened this issue Apr 11, 2024 · 14 comments
Open

Training errors - nans #19

Abelarm opened this issue Apr 11, 2024 · 14 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@Abelarm
Copy link

Abelarm commented Apr 11, 2024

Hi,

First of all thank for the great work!
everything works almost plug-and-play for inference.

I am having trouble while fine-tuning I usually got 2 errors:

logits: torch.Size([16, 512, 128])) to satisfy the constraint GreaterThan(lower_bound=0.0)

or

Categorical(logits: torch.Size([16, 512, 128, 4])) to satisfy the constraint IndependentConstraint(Real(), 1) in this case because there are some nulls in the tensor.

Do you have any idea how to solve it? or it's because there is something bad in my data?

@Abelarm Abelarm changed the title Training errors during finetuning Training errors during finetuning - tensor errors Apr 11, 2024
@gorold
Copy link
Contributor

gorold commented Apr 11, 2024

Could you provide a minimal reproducible example for this error?

@Abelarm
Copy link
Author

Abelarm commented Apr 11, 2024

I am digging in right now and looks like the error comes from the function _get_loc_scale at:

loc = reduce(
            id_mask * reduce(target * observed_mask, "... seq dim -> ... 1 seq", "sum"),
            "... seq1 seq2 -> ... seq1 1",
            "sum",
        )

None of the element of this reduce contains NaNs:

torch.isnan(id_mask).any()
tensor(False, device='mps:0')
torch.isnan(target).any()
tensor(False, device='mps:0')
torch.isnan(observed_mask).any()
tensor(False, device='mps:0')

but loc yes it does:

torch.isnan(loc).any()
tensor(True, device='mps:0')

actually is the right inner reduce which causes the error:

torch.isnan(reduce(target * observed_mask, "... seq dim -> ... 1 seq", "sum")).any()
tensor(True, device='mps:0')

Edit: More details...

I discovered that some values of my target are loaded as inf for some strange reason...

target[4,62,15]
tensor(inf, device='mps:0')

@Abelarm
Copy link
Author

Abelarm commented Apr 12, 2024

Sorry for my previous message @gorold, I should have digged a bit more before writing here. 😢
That problem is solved but now I am stuck with:

distribution Categorical(logits: torch.Size([32, 512, 128, 4])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[[[nan, nan, nan, nan],
          [nan, nan, nan, nan],
          [nan, nan, nan, nan],
          ...,

It appears randomly after some steps (can be even after one epoch).
My debugging discovered the root of the problem in the module: MultiInSizeLinear. At some points the self.weight are all nan so the forward of this module return nan...
the strange thing is that sometimes it could actually go for more than 100 steps before breaking.

Do you have an idea?

@gorold
Copy link
Contributor

gorold commented Apr 12, 2024

Hey @Abelarm, this should be caused by a backward pass performing a gradient update with a NaN value. It could be caused by an inf loss perhaps... One way to debug this to detect any bad values in the training_step and saving all inputs and weights.

@Abelarm
Copy link
Author

Abelarm commented Apr 12, 2024

I think we can say is not due to the inf loss, because I started logging the loss after each step, and the loss of the steps before the NaN in weights are the following..
image

@Abelarm
Copy link
Author

Abelarm commented Apr 12, 2024

It is pretty strange, for a fast debugging I just print out the sum of the weight for each feature of the module MultiInSizeLinear:
image
at step 53 we got a "normal" sum with a "normal" loss
but then the step after everything is NaN 😭

@gorold
Copy link
Contributor

gorold commented Apr 12, 2024

You could try adding the +trainer.detect_anomaly=True flag, the stack trace might be helpful

@Abelarm
Copy link
Author

Abelarm commented Apr 12, 2024

When run with the detect anomly I got:
Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function getKernelAxes, file GPUReductionOps.mm, line 31.
could this be the reason?

Keep in mind that I am under MacOS with arm and I needed to change a couple of double to long due to MPS not supporting double

@gorold
Copy link
Contributor

gorold commented Apr 12, 2024

I think running it on cpu would result in a more helpful error message. But I can't really help any further without being able to reproduce this error.

@Abelarm
Copy link
Author

Abelarm commented Apr 15, 2024

Just an heads-up on this:

All the problems come from the this: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

Then I tried to fix it by changing the line 46 of the file packed_scaler.py from target.double() to target.float() and everything went sideways.

So I don't know if you had in mind the support for native MacOS but for now it look like is not possible to train on apple hardware.

@gorold
Copy link
Contributor

gorold commented Apr 16, 2024

Thanks for the resolution! You could remove the .double() call locally if you need MPS, it's only required to handle time series with very large values.

I don't think we'll support MPS for now.

@gorold gorold closed this as completed Apr 16, 2024
@fmmoret
Copy link

fmmoret commented Apr 24, 2024

I have also gotten nan'd weights when doing pretraining on A100s and a 3090 ti when using the lotsa or gluonts datasets.
Many of the datasets are very sparse / nan heavy
0 0 0 0 0 ... actual data.

The distributions projections handle only matching patch sizes -- I think we're getting unlucky sometimes where there are a few good samples for patch size x, few good ones for patch size y, and then 1 sample for patch size y made of all 0s where basically all actual data got masked out.

I haven't narrowed it down yet, but I think something like this might be happening. All 0s leads to 0 variance and some of the distribution code divides by variance. I think it gets clamped to dtype's epsilon, but that might result in huge magnitudes that turn into infs & nan's somewhere.

They happen pretty rarely, so it's hard to repro reliably

@gorold
Copy link
Contributor

gorold commented Apr 24, 2024

Hey @fmmoret, thanks for reporting this, we've also seen this occasionally. Another possible reason this could be happening could be the attention layer, if all tokens are masked.

Re-opening this issue to track this pre-training issue.

@gorold gorold reopened this Apr 24, 2024
@gorold gorold added the bug Something isn't working label Apr 24, 2024
@gorold gorold changed the title Training errors during finetuning - tensor errors Training errors - nans Apr 24, 2024
@chenghaoliu89
Copy link
Contributor

I have also gotten nan'd weights when doing pretraining on A100s and a 3090 ti when using the lotsa or gluonts datasets. Many of the datasets are very sparse / nan heavy 0 0 0 0 0 ... actual data.

The distributions projections handle only matching patch sizes -- I think we're getting unlucky sometimes where there are a few good samples for patch size x, few good ones for patch size y, and then 1 sample for patch size y made of all 0s where basically all actual data got masked out.

I haven't narrowed it down yet, but I think something like this might be happening. All 0s leads to 0 variance and some of the distribution code divides by variance. I think it gets clamped to dtype's epsilon, but that might result in huge magnitudes that turn into infs & nan's somewhere.

They happen pretty rarely, so it's hard to repro reliably

For this case, a quick fix is to remove the outlier samples by adding one line

batch = [sample for sample in batch if (((sample['observed_mask'][(sample['prediction_mask']==False), :sample['patch_size'][0]])).any())
                 and (((sample['observed_mask'][(sample['prediction_mask']==True), :sample['patch_size'][0]])).any())]

in loader.py line 107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants