deepspeedai / Megatron-DeepSpeed Public

forked from NVIDIA/Megatron-LM

Notifications
Fork 349
Star 2k

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Issues: deepspeedai/Megatron-DeepSpeed

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

131 Open 58 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

（Multi node train)172.16.220.18: bash: line 0: cd: /data/asc23/src_zero3: No such file or directory 172.16.220.18: bash: /home/lxw/anaconda3/envs/sys_py37/bin/python: No such file or directory pdsh@worker-10: 172.16.220.18: ssh exited with exit code 127

#126 opened Apr 21, 2023 by Traveller2001

RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7fbb8f4817f0>

#124 opened Apr 19, 2023 by Thewillman

Broken url link

#122 opened Apr 9, 2023 by tuhinmallick

Issues with DeepSpeed optimizer and tensor parallelism when changing topology between machines

#121 opened Mar 31, 2023 by liutaocode

How to finetune Llama-65b with this project？

#120 opened Mar 22, 2023 by hujunchao

only tuning the layernorm or added adapter params error

#118 opened Mar 14, 2023 by MultiModalPromptTuning

Training stuck at Round robin gradient partitioning?

#116 opened Mar 6, 2023 by SefaZeng

Encountered error when enabling ZeRO and CPU Activation Checkpointing at the same time.

#115 opened Mar 5, 2023 by zincnode

How to load huggingface pretrained model T5 and train further?

#114 opened Mar 1, 2023 by lierik

Are there any other layer norm functions, such as RMSNorm or DeepNorm

#111 opened Feb 13, 2023 by lvcc2018

Is mos_loss calculated inside tensor model parallel region?

#108 opened Feb 3, 2023 by cryoco

Website documentation is incoherent with the repository content

#107 opened Jan 19, 2023 by AnthoJack

The FLOPS per GPU reported for the Megatron GPT model by the DeepSpeed Flops Profiler is much lower than that reported in the logs when we run pretrain_gpt.py

#98 opened Nov 29, 2022 by shrutiramesh1988

AttributeError: module 'transformer_inference' has no attribute 'layer_norm_fp16'

#97 opened Nov 28, 2022 by ranggihwang

If I just want to pretrain a simple gpt model without these characteristics, which script should I refer to?

#94 opened Nov 10, 2022 by AQA6666

The process is stuck at this step:compiling and loading fused kernels ...

#93 opened Nov 10, 2022 by AQA6666

deepspeed to megatron - mismatch in function definition and call

#91 opened Oct 14, 2022 by MatejUlcar

Vocab size mismatch for T5

#90 opened Oct 1, 2022 by ShivanshuPurohit

Does Deepspeed compatible with megatron3.0 ?

#86 opened Sep 20, 2022 by pangsg

MoE Checkpoint size

#85 opened Sep 7, 2022 by yunoJ

Issue generating text with GPT: "KeyError: 50284"

#83 opened Aug 26, 2022 by gcunhase

Issue loading GPT2 checkpoint: "torch.nn.modules.module.ModuleAttributeError: 'ParallelTransformerLayer' object has no attribute 'self_attention'"

#82 opened Aug 26, 2022 by gcunhase

megatron-deepspeed layernorm has different output compare with megatron-lm?

#81 opened Aug 22, 2022 by Kite0011

gpt_6.7B_PR-MoE16: CUDA out of memory

#70 opened Aug 2, 2022 by fighterhit

GPT-2 with pipeline parallel and bfloat16 doesn't work

#58 opened Jun 29, 2022 by assij

Previous 1 2 3 4 5 6 Next

Previous Next

ProTip! Mix and match filters to narrow down what you’re looking for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly