You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
/torchtitan# CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh
+ NGPU=8
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/llama3_8b.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/llama3_8b.toml
/opt/conda/lib/python3.11/site-packages/torch/utils/_pytree.py:185: FutureWarning: optree is installed but the version is too old to support PyTorch Dynamo in C++ pytree. C++ pytree support is disabled. Please consider upgrading optree using `python3 -m pip install --upgrade 'optree>=0.13.0'`.
warnings.warn(
W0127 19:58:42.094000 381097 site-packages/torch/distributed/run.py:792]
W0127 19:58:42.094000 381097 site-packages/torch/distributed/run.py:792] *****************************************
W0127 19:58:42.094000 381097 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0127 19:58:42.094000 381097 site-packages/torch/distributed/run.py:792] *****************************************
[rank0]:/opt/conda/lib/python3.11/site-packages/torch/utils/_pytree.py:185: FutureWarning: optree is installed but the version is too old to support PyTorch Dynamo in C++ pytree. C++ pytree support is disabled. Please consider upgrading optree using `python3 -m pip install --upgrade 'optree>=0.13.0'`.
[rank0]: warnings.warn(
[rank0]:2025-01-27 19:58:48,889 - root - INFO - Starting job: Llama 3 8B training
[rank0]:2025-01-27 19:58:49,213 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2025-01-27 19:58:49,218 - root - INFO - CUDA capacity: NVIDIA A100-SXM4-80GB with 79.14GiB memory
[rank0]:2025-01-27 19:58:49,221 - root - WARNING - Error running lspci: [Errno 2] No such file or directory: 'lspci', fallback to use device_name
[rank0]:2025-01-27 19:58:49,221 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
[rank0]:2025-01-27 19:58:49,221 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
[rank0]:2025-01-27 19:58:53,172 - root - INFO - Building tiktoken tokenizer locally from ./torchtitan/datasets/tokenizer/original/tokenizer.model
[rank0]:2025-01-27 19:58:53,405 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:2025-01-27 19:58:53,405 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:2025-01-27 19:58:59,971 - root - INFO - Building llama3 8B with ModelArgs(dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, norm_type='rmsnorm')
[rank0]:2025-01-27 19:59:00,228 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:2025-01-27 19:59:00,229 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]: File "torchtitan/train.py", line 434, in <module>
[rank0]:[rank0]: main(config)
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
[rank0]:[rank0]: return f(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "torchtitan/train.py", line 170, in main
[rank0]:[rank0]: models_parallelize_fns[model_name](model, world_mesh, parallel_dims, job_config)
[rank0]:[rank0]: File "/torchtitan/parallelisms/parallelize_llama.py", line 87, in parallelize_llama
[rank0]:[rank0]: apply_fsdp(
[rank0]:[rank0]: File "torchtitan/torchtitan/parallelisms/parallelize_llama.py", line 334, in apply_fsdp
[rank0]:[rank0]: fully_shard(
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/_composable/contract.py", line 190, in wrapper
[rank0]:[rank0]: updated = func(inp_module, *args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fully_shard.py", line 176, in fully_shard
[rank0]:[rank0]: state._fsdp_param_group = FSDPParamGroup(
[rank0]:[rank0]: ^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 131, in __init__
[rank0]:[rank0]: self.fsdp_params = [
[rank0]:[rank0]: ^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 132, in <listcomp>
[rank0]:[rank0]: FSDPParam(
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param.py", line 240, in __init__
[rank0]:[rank0]: self._init_sharded_param(param, device, shard_placement_fn)
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:[rank0]: return func(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param.py", line 395, in _init_sharded_param
[rank0]:[rank0]: self.sharded_param = nn.Parameter(self.to_sharded_dtensor(sharded_param))
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param.py", line 615, in to_sharded_dtensor
[rank0]:[rank0]: return _from_local_no_grad(
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_common.py", line 151, in _from_local_no_grad
[rank0]:[rank0]: return DTensor(
[rank0]:[rank0]: ^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_compile.py", line 46, in inner
[rank0]:[rank0]: import torch._dynamo
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/__init__.py", line 42, in <module>
[rank0]:[rank0]: from .polyfills import loader as _ # usort: skip # noqa: F401
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/polyfills/loader.py", line 24, in <module>
[rank0]:[rank0]: POLYFILLED_MODULES: Tuple["ModuleType", ...] = tuple(
[rank0]:[rank0]: ^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/polyfills/loader.py", line 25, in <genexpr>
[rank0]:[rank0]: importlib.import_module(f".{submodule}", package=polyfills.__name__)
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/importlib/__init__.py", line 126, in import_module
[rank0]:[rank0]: return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/polyfills/builtins.py", line 26, in <module>
[rank0]:[rank0]: @substitute_in_graph(builtins.all, can_constant_fold_through=True)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/decorators.py", line 369, in wrapper
[rank0]:[rank0]: rule_map: Dict[Any, Type[VariableTracker]] = get_torch_obj_rule_map()
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/trace_rules.py", line 2882, in get_torch_obj_rule_map
[rank0]:[rank0]: obj = load_object(k)
[rank0]:[rank0]: ^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/trace_rules.py", line 2913, in load_object
[rank0]:[rank0]: val = _load_obj_from_str(x[0])
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/trace_rules.py", line 2897, in _load_obj_from_str
[rank0]:[rank0]: return getattr(importlib.import_module(module), obj_name)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/importlib/__init__.py", line 126, in import_module
[rank0]:[rank0]: return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_higher_order_ops/map.py", line 6, in <module>
[rank0]:[rank0]: from torch._functorch.aot_autograd import AOTConfig, create_joint
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 147, in <module>
[rank0]:[rank0]: from .partitioners import default_partition
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/partitioners.py", line 31, in <module>
[rank0]:[rank0]: from ._activation_checkpointing.graph_info_provider import GraphInfoProvider
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_activation_checkpointing/graph_info_provider.py", line 3, in <module>
[rank0]:[rank0]: import networkx as nx
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/networkx/__init__.py", line 23, in <module>
[rank0]:[rank0]: config = utils.backends._set_configs_from_environment()
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/networkx/utils/backends.py", line 574, in _set_configs_from_environment
[rank0]:[rank0]: backends=Config(
[rank0]:[rank0]: ^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/site-packages/networkx/utils/configs.py", line 84, in __new__
[rank0]:[rank0]: cls = dataclass(
[rank0]:[rank0]: ^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/dataclasses.py", line 1222, in wrap
[rank0]:[rank0]: return _process_class(cls, init, repr, eq, order, unsafe_hash,
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/dataclasses.py", line 1027, in _process_class
[rank0]:[rank0]: _init_fn(all_init_fields,
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/dataclasses.py", line 580, in _init_fn
[rank0]:[rank0]: return _create_fn('__init__',
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/opt/conda/lib/python3.11/dataclasses.py", line 433, in _create_fn
[rank0]:[rank0]: exec(txt, globals, ns)
[rank0]:[rank0]: File "<string>", line 1
[rank0]:[rank0]: def __create_fn__(_type_nx-loopback, MISSING, _HAS_DEFAULT_FACTORY, __dataclass_builtins_object__, _return_type):
[rank0]:[rank0]: ^
[rank0]:[rank0]: SyntaxError: invalid syntax
[rank0]:[rank0]:[W127 19:59:00.196956265 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
no changes in config
The text was updated successfully, but these errors were encountered:
no changes in config
The text was updated successfully, but these errors were encountered: