Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to run 8b llama #807

Open
asahni04 opened this issue Jan 27, 2025 · 1 comment
Open

unable to run 8b llama #807

asahni04 opened this issue Jan 27, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@asahni04
Copy link

asahni04 commented Jan 27, 2025

/torchtitan# CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh
+ NGPU=8
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/llama3_8b.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/llama3_8b.toml
/opt/conda/lib/python3.11/site-packages/torch/utils/_pytree.py:185: FutureWarning: optree is installed but the version is too old to support PyTorch Dynamo in C++ pytree. C++ pytree support is disabled. Please consider upgrading optree using `python3 -m pip install --upgrade 'optree>=0.13.0'`.
  warnings.warn(
W0127 19:58:42.094000 381097 site-packages/torch/distributed/run.py:792] 
W0127 19:58:42.094000 381097 site-packages/torch/distributed/run.py:792] *****************************************
W0127 19:58:42.094000 381097 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0127 19:58:42.094000 381097 site-packages/torch/distributed/run.py:792] *****************************************
[rank0]:/opt/conda/lib/python3.11/site-packages/torch/utils/_pytree.py:185: FutureWarning: optree is installed but the version is too old to support PyTorch Dynamo in C++ pytree. C++ pytree support is disabled. Please consider upgrading optree using `python3 -m pip install --upgrade 'optree>=0.13.0'`.
[rank0]:  warnings.warn(
[rank0]:2025-01-27 19:58:48,889 - root - INFO - Starting job: Llama 3 8B training
[rank0]:2025-01-27 19:58:49,213 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2025-01-27 19:58:49,218 - root - INFO - CUDA capacity: NVIDIA A100-SXM4-80GB with 79.14GiB memory
[rank0]:2025-01-27 19:58:49,221 - root - WARNING - Error running lspci: [Errno 2] No such file or directory: 'lspci', fallback to use device_name
[rank0]:2025-01-27 19:58:49,221 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
[rank0]:2025-01-27 19:58:49,221 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
[rank0]:2025-01-27 19:58:53,172 - root - INFO - Building tiktoken tokenizer locally from ./torchtitan/datasets/tokenizer/original/tokenizer.model
[rank0]:2025-01-27 19:58:53,405 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:2025-01-27 19:58:53,405 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:2025-01-27 19:58:59,971 - root - INFO - Building llama3 8B with ModelArgs(dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, norm_type='rmsnorm')
[rank0]:2025-01-27 19:59:00,228 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:2025-01-27 19:59:00,229 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]:   File "torchtitan/train.py", line 434, in <module>
[rank0]:[rank0]:     main(config)
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
[rank0]:[rank0]:     return f(*args, **kwargs)
[rank0]:[rank0]:            ^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "torchtitan/train.py", line 170, in main
[rank0]:[rank0]:     models_parallelize_fns[model_name](model, world_mesh, parallel_dims, job_config)
[rank0]:[rank0]:   File "/torchtitan/parallelisms/parallelize_llama.py", line 87, in parallelize_llama
[rank0]:[rank0]:     apply_fsdp(
[rank0]:[rank0]:   File "torchtitan/torchtitan/parallelisms/parallelize_llama.py", line 334, in apply_fsdp
[rank0]:[rank0]:     fully_shard(
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/_composable/contract.py", line 190, in wrapper
[rank0]:[rank0]:     updated = func(inp_module, *args, **kwargs)
[rank0]:[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fully_shard.py", line 176, in fully_shard
[rank0]:[rank0]:     state._fsdp_param_group = FSDPParamGroup(
[rank0]:[rank0]:                               ^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 131, in __init__
[rank0]:[rank0]:     self.fsdp_params = [
[rank0]:[rank0]:                        ^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 132, in <listcomp>
[rank0]:[rank0]:     FSDPParam(
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param.py", line 240, in __init__
[rank0]:[rank0]:     self._init_sharded_param(param, device, shard_placement_fn)
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:[rank0]:     return func(*args, **kwargs)
[rank0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param.py", line 395, in _init_sharded_param
[rank0]:[rank0]:     self.sharded_param = nn.Parameter(self.to_sharded_dtensor(sharded_param))
[rank0]:[rank0]:                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param.py", line 615, in to_sharded_dtensor
[rank0]:[rank0]:     return _from_local_no_grad(
[rank0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_common.py", line 151, in _from_local_no_grad
[rank0]:[rank0]:     return DTensor(
[rank0]:[rank0]:            ^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_compile.py", line 46, in inner
[rank0]:[rank0]:     import torch._dynamo
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/__init__.py", line 42, in <module>
[rank0]:[rank0]:     from .polyfills import loader as _  # usort: skip # noqa: F401
[rank0]:[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/polyfills/loader.py", line 24, in <module>
[rank0]:[rank0]:     POLYFILLED_MODULES: Tuple["ModuleType", ...] = tuple(
[rank0]:[rank0]:                                                    ^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/polyfills/loader.py", line 25, in <genexpr>
[rank0]:[rank0]:     importlib.import_module(f".{submodule}", package=polyfills.__name__)
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/importlib/__init__.py", line 126, in import_module
[rank0]:[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/polyfills/builtins.py", line 26, in <module>
[rank0]:[rank0]:     @substitute_in_graph(builtins.all, can_constant_fold_through=True)
[rank0]:[rank0]:      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/decorators.py", line 369, in wrapper
[rank0]:[rank0]:     rule_map: Dict[Any, Type[VariableTracker]] = get_torch_obj_rule_map()
[rank0]:[rank0]:                                                  ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/trace_rules.py", line 2882, in get_torch_obj_rule_map
[rank0]:[rank0]:     obj = load_object(k)
[rank0]:[rank0]:           ^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/trace_rules.py", line 2913, in load_object
[rank0]:[rank0]:     val = _load_obj_from_str(x[0])
[rank0]:[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/trace_rules.py", line 2897, in _load_obj_from_str
[rank0]:[rank0]:     return getattr(importlib.import_module(module), obj_name)
[rank0]:[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/importlib/__init__.py", line 126, in import_module
[rank0]:[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_higher_order_ops/map.py", line 6, in <module>
[rank0]:[rank0]:     from torch._functorch.aot_autograd import AOTConfig, create_joint
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 147, in <module>
[rank0]:[rank0]:     from .partitioners import default_partition
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/partitioners.py", line 31, in <module>
[rank0]:[rank0]:     from ._activation_checkpointing.graph_info_provider import GraphInfoProvider
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_activation_checkpointing/graph_info_provider.py", line 3, in <module>
[rank0]:[rank0]:     import networkx as nx
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/networkx/__init__.py", line 23, in <module>
[rank0]:[rank0]:     config = utils.backends._set_configs_from_environment()
[rank0]:[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/networkx/utils/backends.py", line 574, in _set_configs_from_environment
[rank0]:[rank0]:     backends=Config(
[rank0]:[rank0]:              ^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/site-packages/networkx/utils/configs.py", line 84, in __new__
[rank0]:[rank0]:     cls = dataclass(
[rank0]:[rank0]:           ^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/dataclasses.py", line 1222, in wrap
[rank0]:[rank0]:     return _process_class(cls, init, repr, eq, order, unsafe_hash,
[rank0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/dataclasses.py", line 1027, in _process_class
[rank0]:[rank0]:     _init_fn(all_init_fields,
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/dataclasses.py", line 580, in _init_fn
[rank0]:[rank0]:     return _create_fn('__init__',
[rank0]:[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/opt/conda/lib/python3.11/dataclasses.py", line 433, in _create_fn
[rank0]:[rank0]:     exec(txt, globals, ns)
[rank0]:[rank0]:   File "<string>", line 1
[rank0]:[rank0]:     def __create_fn__(_type_nx-loopback, MISSING, _HAS_DEFAULT_FACTORY, __dataclass_builtins_object__, _return_type):
[rank0]:[rank0]:                               ^
[rank0]:[rank0]: SyntaxError: invalid syntax
[rank0]:[rank0]:[W127 19:59:00.196956265 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

no changes in config

@weifengpy
Copy link
Contributor

the error came from this line, import torch._dynamo

could you run following script outside the context of torchtitan, so we know whether it's a pytorch requriements problem

import torch
import torch._dynamo

@tianyu-l tianyu-l added the bug Something isn't working label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants