-
Notifications
You must be signed in to change notification settings - Fork 262
Issues: pytorch/torchtitan
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
WARNING - When using FSDP, it's recommended to enable config.force_recompute_fp8_weight_in_bwd.
#821
opened Feb 5, 2025 by
c0g
Is user-defined initializers a must-have for FSDP2?
question
Further information is requested
#818
opened Feb 4, 2025 by
goldhuang
HSDP causes loss instability
question
Further information is requested
#813
opened Jan 31, 2025 by
apkumar
FSDP checkpoints don't load when run is restarted with greater world size
bug
Something isn't working
documentation
Improvements or additions to documentation
enhancement
New feature or request
debug model training hangs on NVIDIA B200 with >1 GPU
bug
Something isn't working
#810
opened Jan 28, 2025 by
vkuzo
Loss metrics dramatically change after resuming from checkpoint
bug
Something isn't working
enhancement
New feature or request
Gradient Scaling With Pipeline Parallelism
question
Further information is requested
#803
opened Jan 24, 2025 by
windsornguyen
should we have an extension point for model transforms out of tree?
enhancement
New feature or request
#790
opened Jan 15, 2025 by
vkuzo
[Bug] Unexpected performance drop with float8 training + compiling only nn.Linear layers + using selective per op AC
bug
Something isn't working
#786
opened Jan 10, 2025 by
danielvegamyhre
Why use RowwiseParallel for nn.Embedding instead of ColwiseParallel?
question
Further information is requested
#785
opened Jan 10, 2025 by
corey-lambda
BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1
bug
Something isn't working
#777
opened Jan 7, 2025 by
cassanof
PP hangs when pipeline_parallel_microbatches < pipeline_parallel_degree
bug
Something isn't working
#775
opened Jan 6, 2025 by
cassanof
PP InterleavedZeroBubble schedule shows low TPS and high memory usage
bug
Something isn't working
release_blocking
Issues that are blocking the milestone / release completion
FSDP 2 doesn't pad tensors?
question
Further information is requested
#764
opened Dec 29, 2024 by
cassanof
Checkpoint conversion
question
Further information is requested
#758
opened Dec 20, 2024 by
MaxiBoether
[question]can't disable CP for specific (unsupported) SDPA op
enhancement
New feature or request
module: context parallel
#757
opened Dec 20, 2024 by
FindDefinition
Any plans to support DPO training?
enhancement
New feature or request
#756
opened Dec 20, 2024 by
xs1997zju
JobConfig does not support typing
enhancement
New feature or request
#753
opened Dec 18, 2024 by
greeneggsandyaml
Model init with HuggingFace model
bug
Something isn't working
question
Further information is requested
#743
opened Dec 16, 2024 by
neeldani
Low bit Optimizers & FA-3
bug
Something isn't working
question
Further information is requested
#742
opened Dec 16, 2024 by
asahni04
using fsdp2 wrapper Flux(text to image) model , gradient is inconsistent with fsdp1
question
Further information is requested
#734
opened Dec 13, 2024 by
yanmj0601
Previous Next
ProTip!
Follow long discussions with comments:>50.