Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gencast_mini_demo.ipynb on AMD CPU #113

Open
dkokron opened this issue Dec 21, 2024 · 12 comments
Open

gencast_mini_demo.ipynb on AMD CPU #113

dkokron opened this issue Dec 21, 2024 · 12 comments

Comments

@dkokron
Copy link

dkokron commented Dec 21, 2024

I'm attempting to run the gencast_mini_demo.ipynb case on my home workstation without a GPU. The notebook recognizes that I don't have the correct software to run on the installed GPU and fails over to CPU (which is what want to happen).

Output from cell 22.
WARNING:2024-12-21 14:22:21,184:jax._src.xla_bridge:969: An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.

I've attached the stack trace I get from cell 23 (Autoregressive rollout (loop in python)).
gencast.failure.txt

Is this expected? Does GenCast require a GPU or TPU to work?

@andrewlkd
Copy link
Collaborator

Hey,

This looks like a splash attention related error. Splash attention is only supported on TPU.

You can try follow the GPU instructions to change attention mechanism, I believe this should work fine on CPU.

Note that without knowing the memory specifications of your device, I can't guarantee it won't run out of memory. We've also never run GenCast on CPU so cannot make any guarantees around its correctness.

Hope that helps!

Andrew

@dkokron
Copy link
Author

dkokron commented Dec 21, 2024

I will try your suggestion and report back here.

@dkokron
Copy link
Author

dkokron commented Dec 22, 2024

I followed the suggestion in the "Running Inference on GPU" section of cloud_vm_setup.md

task_config = ckpt.task_config
sampler_config = ckpt.sampler_config
noise_config = ckpt.noise_config
noise_encoder_config = ckpt.noise_encoder_config
denoiser_architecture_config = ckpt.denoiser_architecture_config
denoiser_architecture_config.sparse_transformer_config.attention_type = "triblockdiag_mha"
denoiser_architecture_config.sparse_transformer_config.mask_type = "full"

The job (4 time steps and 8 members) ran for about 2h:30m using 17GB of system RAM with an averaged CPU load of ~30 (I have 48 cores). Unfortunately, the results are all NaN.

GenCast/graphcast/GenCast/lib/python3.12/site-packages/numpy/lib/_nanfunctions_impl.py:1409: RuntimeWarning: All-NaN slice encountered
return _nanquantile_unchecked(

@andrewlkd
Copy link
Collaborator

I can't say I've seen this warning before. Could you confirm if the entire forecast was NaN? Note that we expect NaNs in the sea surface temperature variable so I wonder if this is what you might be encountering.

@dkokron
Copy link
Author

dkokron commented Dec 24, 2024

I was plotting 2m_temp for all 8 ensemble members. All members had this same warning. I'll need to run it again to view other variables.

@dkokron
Copy link
Author

dkokron commented Dec 24, 2024

specific humidity at 850 and 100, vertical speed at 850, geopotential at 500 and u and v components of wind at 925 are also NaN. I did not look at the rest.

@dkokron
Copy link
Author

dkokron commented Dec 29, 2024

Any more ideas on how to investigate this issue?

@andrewlkd
Copy link
Collaborator

Unfortunately, we've never attempted to run the model on a CPU as this is too slow for practical uses. In principal there should be no reason why it should differ but unexpected device-specific compilation issues may be manifesting here. In the mean time hopefully the instructions on how to use free cloud compute are useful.

Do let us know if you gain any insights on why this is happening.

@dkokron
Copy link
Author

dkokron commented Jan 7, 2025

If you've never attempted to run it on a decent CPU, then how do you know it won't be practical?
I'll see if I can figure out what is going wrong and report back here.

@guidov
Copy link

guidov commented Jan 7, 2025

I also think it would be nice to be able to set up the model config and run it for one timestep on a our own CPU systems and then move it to cloud GPU or TPU. CPU systems have very large RAM nowadays.
I tried playing with this and got nan's also.

I set this in the notebook , but if the CPU count is greater than 1 , I get an AssertionError.

config.update("jax_platform_name", "cpu")

# Set the environment variable
os.environ['XLA_FLAGS'] = '--xla_force_host_platform_device_count=24'
# Verify it's set
print(f"XLA_FLAGS: {os.getenv('XLA_FLAGS')}")

print(jax.devices())
jax.local_device_count(backend='cpu') 

In the 'build jitted' section:

loss_fn_jitted = jax.jit(
    lambda rng, i, t, f: loss_fn.apply(params, state, rng, i, t, f)[0]
, backend='cpu')
grads_fn_jitted = jax.jit(grads_fn, backend='cpu')
run_forward_jitted = jax.jit(
    lambda rng, i, t, f: run_forward.apply(params, state, rng, i, t, f)[0]
, backend='cpu')
# We also produce a pmapped version for running in parallel.
run_forward_pmap = xarray_jax.pmap(run_forward_jitted, dim="sample", backend='cpu')

@andrewlkd Maybe #108 can be of some use, however, obviously, I don't understand how jax is working here with the CPUs.

When the cpu device count is set to 1, it uses all the CPUs anyway.

@dkokron
Copy link
Author

dkokron commented Jan 12, 2025

results from debugging so far are attached. I put a breakpoint in function chunked_prediction_generator() from rollout.py before predictor_fn(). I then printed out some variables looking for NaNs, then hit continue. The stack trace is in the attached text file. Please review and let me know if this help shed any light on how the NaNs are being generated.

Debugging.txt

@andrewlkd
Copy link
Collaborator

Hm, I'm not so sure this does shed light. This just suggests something in the actual predictor function (i.e. forward pass of GenCast) is causing NaNs when running on CPU.

In case it was something to do with the pmapping, I just tried on my end to run in the non pmapped case and it still produces NaNs.

Let me know if you get any more data points from debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants