-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ImportError - fast_transformers/causal_product undefined symbol - unable to train or finetune #6
Comments
Update: I tried starting from scratch on a different machine that's running nvcc 11.0 / CUDA 11.0. So the only changes I had to make to the installation instructions were:
But this time, the Apex compilation fails with the following error:
Here are the details for this system:
Thank you in advance for any advice you can provide! |
Not sure if this will solve your issue but I found that apex compilation worked when I changed to the 22.04-dev branch ( |
Thanks for sharing, but unfortunately this didn't solve my problem. On the first machine I mentioned above, I was able to compile apex but then got an error while running the molformer code. On the second machine, I wasn't able to compile apex. After running |
On the first machine referenced above, I used the following installation this time (replacing conda with mamba for speed):
Everything with The error when running |
It doesn't seem like the authors are replying to anyone so I'll try and help. Here is the env that works for me:
Looks like the torch version that you used is different to mine. Hopefully this helps you! Edit: I used cuda 11.0 in the conda env & system install through nvidia repos |
On the second machine, I ran:
I've upgraded CUDA on the second machine since my last post, so now
My new error is
|
@philspence thanks for sharing your installation process. It doesn't look like the different version of pytorch fixes it for me. Can you confirm whether after you compile apex successfully, are you also able to run |
Haven't ran that particular benchmark previously, the other benchmarks have completed without issue. I've started the h298 benchmark and seems to be running okay (only finished one epoch so far). Have you tried to replicate my env? I notice that my compile command uses the |
Yeah, I tried to replicate yours, still wasn't able to get it working. It didn't work regardless of whether or not I used the |
Did you also change your system cuda version to 11.0.3? Here is my
And my system cuda install:
|
Unfortunately I can't change my system CUDA version because this machine is a group workstation, and changing the system version would be disruptive for the other users. But when I first encountered this problem, my system was on cuda 11.0, so I don't think that's the problem. |
Ah, in which case, install cudatoolkit-dev:
and set your cuda home to be the base of your conda env: That worked for me on a different system. If that doesn't work for you then I'm out of ideas! |
I see that many people are getting frustrated because of cuda version issues with Apex. My solution is that if you only want to fine-tune the model, you can avoid using Apex. Instead, I replaced the optimizer in the Python script located in the "finetune" folder with the ones available in the standard torch.optim module. At least, I was able to successfully execute the ...
# from apex import optimizers
from torch.optim import Adam
...
def configure_optimizers(self):
...
# optimizer = optimizers.FusedLAMB(optim_groups, lr=learning_rate, betas=betas)
optimizer = Adam(optim_groups, lr=learning_rate, betas=betas)
... I only install these modules in my conda env:
And I'm using an ubuntu 22.04 docker, with RTX 3090 gpu It works for me. |
I'm able to extract the embeddings from frozen_embeddings_classification.ipynb For people who are facing the same problem, I created the env with: conda create -n molformer -y python -m pip install rdkit datasets regex Then adopt the response from @BlenderWang9487 def configure_optimizers(self): I hope this also helps. |
This solution to the “fast_transformers/causal_product undefined symbol” error worked for me. I used the install of BlenderWang9487 with the following additional steps:
This should fix the “fast_transformers/causal_product undefined symbol” error. There’s probably a more elegant way to do this, but this is what worked for me. 4.Included the following statements in script: export CUDA_HOME=/home/user/anaconda3/envs/molformer export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/user/anaconda3/envs/molformer/lib edit: jackl-o-o-l had a better solution |
After downloading the data, I go to run
bash run_finetune_h298.sh
and get the following error:I get a similar error when running
bash run_pubchem_light.sh
:I set up my environment based on the instructions in environment.md as follows:
The differences between the above and the original instructions were:
-c conda-forge
to the 2ndconda install
command (it couldn't find the packages otherwise)export CUDA_HOME='/usr'
(the actual location on my system, found usingwhich nvcc
, which gave the output/usr/bin/nvcc
)pytorch
andcudatoolkit
versions to match thenvcc
version I have installed, which is 11.6 (compiling Apex failed otherwise). I used the oldestpytorch
version that supportedcudatoolkit=11.6
(based on instructions here) to maximize likelihood of compatibility since this repo was created usingpytorch==1.7.1 cudatoolkit=11.0
.Additional information that may be useful:
nvidia-smi
:NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7
Based on similar errors people have gotten with other repos (e.g. here, here), it seems that the problem is related to my version of PyTorch, but I'm not sure how to resolve this while still allowing Apex to compile on my system. Is it possible to run this repo on a system using nvcc 11.6 / CUDA 11.7?
The text was updated successfully, but these errors were encountered: