Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-gpu support #5

Open
R00Kie-Liu opened this issue May 8, 2019 · 10 comments
Open

Multi-gpu support #5

R00Kie-Liu opened this issue May 8, 2019 · 10 comments

Comments

@R00Kie-Liu
Copy link

How to use Multi-gpu to search?

@198808xc
Copy link
Collaborator

198808xc commented May 8, 2019

Thanks for this question!

I think multi-GPU works just like single-GPU. Since our search on CIFAR takes a few hours, we did not consider multi-GPU training. However, during our recent work that generalizes P-DARTS in searching on ImageNet directly, we did use 8 GPUs for acceleration.

@chenxin061 more experiences to share?

@chenxin061
Copy link
Owner

chenxin061 commented May 8, 2019

To search with multiple GPUs, you need to change a few lines in train_search.py.

  1. Delete all lines related to GPU ID setting. Instead, you can set GPU ids with CUDA_VISIBLE_DEVICES.
  2. Add model = nn.DataParallel(model) before model = model.cuda() and model = model.module after it.

@zihaozhang9
Copy link

要使用多个GPU进行搜索,您需要在train_search.py​​中更改几行。

  1. 删除与GPU ID设置相关的所有行。相反,您可以使用CUDA_VISIBLE_DEVICES设置GPU ID。
  2. 在它model = nn.DataParallel(model)之前model = model.cuda()model = model.module之后添加。

To search with multiple GPUs, you need to change a few lines in train_search.py.

  1. Delete all lines related to GPU ID setting. Instead, you can set GPU ids with CUDA_VISIBLE_DEVICES.
  2. Add model = nn.DataParallel(model) before model = model.cuda() and model = model.module after it.

I added code model = nn.DataParallel(model) to file train_search.py

Traceback (most recent call last):
File "train_search.py", line 469, in
main()
File "train_search.py", line 142, in main
optimizer_a = torch.optim.Adam(model.arch_parameters(),
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 518, in getattr
type(self).name, name))
AttributeError: 'DataParallel' object has no attribute 'arch_parameters'

@anhcda-study
Copy link

@zihaozhang9 to fix that
Change model.arch_parameters() to model.module.arch_parameters()

@JarveeLee
Copy link

JarveeLee commented Sep 3, 2019

I did several thing ..
comment set device
#torch.cuda.set_device(args.gpu)
and then

model = nn.DataParallel(model)
model = model.cuda()
model = model.module

then
os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3,4,5,6,7'
parser.add_argument('--batch_size', type=int, default=192, help='batch size')

but still I can not train_search.py on multi gpu, it will still try to overwhelm single gpu then out of memory .... What is wrong here ... ?

I am using pytorch1.0.0 python3.6 and get 4 by print(torch.cuda.device_count())

If I use

model = nn.DataParallel(model)
model = model.cuda()
#model = model.module
and
model.module.arch_parameters()

I will get this error ....
image

@chenxin061
Copy link
Owner

The new version of our code now supports multi-GPU search!
@JarveeLee You can try it.
Use CUDA_VISIBLE_DEVICES to assign GPU ids.

@JarveeLee
Copy link

I see your modification, I did the same to support multi gpu ,what is more ,
in

class MixedOp(nn.Module):
def forward(self, x, weights):
return sum(w * op(x) for w, op in zip(weights, self.m_ops))

shall change to

class MixedOp(nn.Module):
def forward(self, x, weights):
return sum(w.cuda() * op(x.cuda()) for w, op in zip(weights, self.m_ops))

other wise the error that I encountered will still happen...
I am working on a complex awful GPU server , hard to control enviroment , that is my experience...

@davidrpugh
Copy link

@chenxin061 Thanks for sharing your code! Can you confirm whether you used 8 V100 GPUs with 16 GB of memory per card or 8 V100 GPUs with 32 GB memory per card? Thanks!

@chenxin061
Copy link
Owner

@davidrpugh The search code is tested on two P100 GPUs and the evaluating code is tested on 8 V100 with 16GB memory.

@davidrpugh
Copy link

@chenxin061 Thanks! I suspected as much for the V100s. Didn't realize that you used 2 P100s. I was able to complete the search process using CIFAR-10 or CIFAR-100 using a single P100 with 16 GB in between 7-8 hours (as advertised in the paper and README).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants