Releases · huggingface/accelerate

07 Jun 15:27

muellerzr

v0.31.0

66eefd7

v0.31.0: Better support for sharded state dict with FSDP and Bugfixes

Core

Set timeout default to PyTorch defaults based on backend by @muellerzr in #2758
fix duplicate elements in split_between_processes by @hkunzhe in #2781
Add Elastic Launch Support to notebook_launcher by @yhna940 in #2788
Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790

FSDP

Introduce shard-merging util for FSDP by @muellerzr in #2772
Enable sharded state dict + offload to cpu resume by @muellerzr in #2762
Enable config for fsdp activation checkpointing by @helloworld1 in #2779

Megatron

Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501

What's Changed

Add feature to allow redirecting std streams into log files when using torchrun as the launcher. by @lyuwen in #2740
Update modeling.py by adding try-catch section to skip the unavailable devices by @MeVeryHandsome in #2681
Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity by @statelesshz in #2748
Fix stacklevel in logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions by @luowyang in #2730
LOMO / FIX: Support multiple optimizers by @younesbelkada in #2745
Fix max_memory assignment by @SunMarc in #2751
Fix duplicate environment variable check in multi-cpu condition by @yhna940 in #2752
Simplify CLI args validation and ensure CLI args take precedence over config file. by @Iain-S in #2757
Fix sagemaker config by @muellerzr in #2753
fix cpu omp num threads set by @jiqing-feng in #2755
Revert "Simplify CLI args validation and ensure CLI args take precedence over config file." by @muellerzr in #2763
Enable sharded cpu resume by @muellerzr in #2762
Sets default to PyTorch defaults based on backend by @muellerzr in #2758
optimize get_module_leaves speed by @BBuf in #2756
fix minor typo by @TemryL in #2767
Fix small edge case in get_module_leaves by @SunMarc in #2774
Skip deepspeed test by @SunMarc in #2776
Enable config for fsdp activation checkpointing by @helloworld1 in #2779
Add arg from CLI to fix failing test by @muellerzr in #2783
Skip tied weights disk offload test by @SunMarc in #2782
Introduce shard-merging util for FSDP by @muellerzr in #2772
FIX / FSDP : Guard fsdp utils for earlier PyTorch versions by @younesbelkada in #2794
Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501
Fixup CLI test by @muellerzr in #2796
fix duplicate elements in split_between_processes by @hkunzhe in #2781
Add Elastic Launch Support to notebook_launcher by @yhna940 in #2788
Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790
Fix type in accelerator.py by @qgallouedec in #2800
fix comet ml test by @SunMarc in #2804
New template by @muellerzr in #2808
Fix access error for torch.mps when using torch==1.13.1 on macOS by @SunMarc in #2806
4-bit quantization meta device bias loading bug by @SunMarc in #2805
State dictionary retrieval from offloaded modules by @blbadger in #2619
add cuda dep for a test by @SunMarc in #2820
Remove out-dated xpu device check code in get_balanced_memory by @faaany in #2826
Fix DeepSpeed config validation error by changing stage3_prefetch_bucket_size value to an integer by @adk9 in #2814
Improve test speeds by up to 30% in multi-gpu settings by @muellerzr in #2830
monitor-interval, take 2 by @muellerzr in #2833
Optimize the megatron plugin by @zhangsheng377 in #2822
fix fstr format by @Jintao-Huang in #2810

New Contributors

@lyuwen made their first contribution in #2740
@MeVeryHandsome made their first contribution in #2681
@luowyang made their first contribution in #2730
@Iain-S made their first contribution in #2757
@BBuf made their first contribution in #2756
@TemryL made their first contribution in #2767
@helloworld1 made their first contribution in #2779
@hkunzhe made their first contribution in #2781
@adk9 made their first contribution in #2814
@Jintao-Huang made their first contribution in #2810

Full Changelog: v0.30.1...v0.31.0

Contributors

adk9, helloworld1, and 19 other contributors

Assets 2

10 May 17:47

muellerzr

v0.30.1

b52803d

v0.30.1: Bugfixes

Patchfix

Fix duplicate environment variable check in multi-cpu condition thanks to @yhna940 in #2752
Fix issue with missing values in the SageMaker config leading to not being able to launch in #2753
Fix CPU OMP num threads setting thanks to @jiqing-feng in #2755
Fix FSDP checkpoint unable to resume when using offloading and sharded weights due to CUDA OOM when loading the optimizer and model #2762
Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity thanks to @statelesshz in #2748
Fix stacklevel in logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions thanks to @luowyang in #2730
Fix support for multiple optimizers when using LOMO thanks to @younesbelkada in #2745

Full Changelog: v0.30.0...v0.30.1

Contributors

luowyang, ji-huazhong, and 3 other contributors

Assets 2

03 May 15:29

muellerzr

v0.30.0

989cc50

v0.30.0: Advanced optimizer support, MoE DeepSpeed support, add upcasting for FSDP, and more

Core

We've simplified the tqdm wrapper to make it fully passthrough, no need to have tqdm(main_process_only, *args), it is now just tqdm(*args) and you can pass in is_main_process as a kwarg.
We've added support for advanced optimizer usage:
- Schedule free optimizer introduced by Meta by @muellerzr in #2631
- LOMO optimizer introduced by OpenLMLab by @younesbelkada in #2695
Enable BF16 autocast to everything during FP8 and enable FSDP by @muellerzr in #2655
Support dataloader send_to_device calls to use non-blocking by @drhead in #2685
allow gather_for_metrics to be more flexible by @SunMarc in #2710
Add cann version info to command accelerate env for NPU by @statelesshz in #2689
Add MLU rng state setter by @ArthurinRUC in #2664
device agnostic testing for hooks&utils&big_modeling by @statelesshz in #2602

Documentation

Through collaboration between @fabianlim (lead contribuitor), @stas00, @pacman100, and @muellerzr we have a new concept guide out for FSDP and DeepSpeed explicitly detailing how each interop and explaining fully and clearly how each of those work. This was a momumental effort by @fabianlim to ensure that everything can be as accurate as possible to users. I highly recommend visiting this new documentation, available here
New distributed inference examples have been added thanks to @SunMarc in #2672
Fixed some docs for using internal trackers by @brentyi in #2650

DeepSpeed

Accelerate can now handle MoE models when using deepspeed, thanks to @pacman100 in #2662
Allow "auto" for gradient clipping in YAML by @regisss in #2649
Introduce a deepspeed-specific Docker image by @muellerzr in #2707. To use, pull the gpu-deepspeed tag docker pull huggingface/accelerate:cuda-deepspeed-nightly

Megatron

Megatron plugin can support NPU by @zhangsheng377 in #2667

Big Modeling

Add strict arg to load_checkpoint_and_dispatch by @SunMarc in #2641

Bug Fixes

Fix up state with xla + performance regression by @muellerzr in #2634
Parenthesis on xpu_available by @muellerzr in #2639
Fix is_train_batch_min type in DeepSpeedPlugin by @yhna940 in #2646
Fix backend check by @jiqing-feng in #2652
Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in #2694
Block AMP for MPS device by @SunMarc in #2699
Fixed issue when doing multi-gpu training with bnb when the first gpu is not used by @SunMarc in #2714
Fixup free_memory to deal with garbage collection by @muellerzr in #2716
Fix sampler serialization failing by @SunMarc in #2723
Fix deepspeed offload device type in the arguments to be more accurate by @yhna940 in #2717

Full Changelog

Schedule free optimizer support by @muellerzr in #2631
Fix up state with xla + performance regression by @muellerzr in #2634
Parenthesis on xpu_available by @muellerzr in #2639
add third-party device prefix to execution_device by @faaany in #2612
add strict arg to load_checkpoint_and_dispatch by @SunMarc in #2641
device agnostic testing for hooks&utils&big_modeling by @statelesshz in #2602
Docs fix for using internal trackers by @brentyi in #2650
Allow "auto" for gradient clipping in YAML by @regisss in #2649
Fix is_train_batch_min type in DeepSpeedPlugin by @yhna940 in #2646
Don't use deprecated Repository anymore by @Wauplin in #2658
Fix test_from_pretrained_low_cpu_mem_usage_measured failure by @yuanwu2017 in #2644
Add MLU rng state setter by @ArthurinRUC in #2664
fix backend check by @jiqing-feng in #2652
Megatron plugin can support NPU by @zhangsheng377 in #2667
Revert "fix backend check" by @muellerzr in #2669
tqdm: *args should come ahead of main_process_only by @rb-synth in #2654
Handle MoE models with DeepSpeed by @pacman100 in #2662
Fix deepspeed moe test with version check by @pacman100 in #2677
Pin DS...again.. by @muellerzr in #2679
fix backend check by @jiqing-feng in #2670
Deprecate tqdm args + slight logic tweaks by @muellerzr in #2673
Enable BF16 autocast to everything during FP8 + some tweaks to enable FSDP by @muellerzr in #2655
Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in #2694
Simplify test logic by @pacman100 in #2697
Add source code for DataLoader Animation by @muellerzr in #2696
Block AMP for MPS device by @SunMarc in #2699
Do a pip freeze during workflows by @muellerzr in #2704
add cann version info to command accelerate env by @statelesshz in #2689
Add version checks for the import of DeepSpeed moe utils by @pacman100 in #2705
Change dataloader send_to_device calls to non-blocking by @drhead in #2685
add distributed examples by @SunMarc in #2672
Add diffusers to req by @muellerzr in #2711
fix bnb multi gpu training by @SunMarc in #2714
allow gather_for_metrics to be more flexible by @SunMarc in #2710
Add Upcasting for FSDP in Mixed Precision. Add Concept Guide for FSPD and DeepSpeed. by @fabianlim in #2674
Segment out a deepspeed docker image by @muellerzr in #2707
Fixup free_memory to deal with garbage collection by @muellerzr in #2716
fix sampler serialization by @SunMarc in #2723
Fix sampler failing test by @SunMarc in #2728
Docs: Fix build main documentation by @SunMarc in #2729
Fix Documentation in FSDP and DeepSpeed Concept Guide by @fabianlim in #2725
Fix deepspeed offload device type by @yhna940 in #2717
FEAT: Add LOMO optimizer by @younesbelkada in #2695
Fix tests on main by @muellerzr in #2739

New Contributors

@brentyi made their first contribution in #2650
@regisss made their first contribution in #2649
@yhna940 made their first contribution in #2646
@Wauplin made their first contribution in #2658
@ArthurinRUC made their first contribution in #2664
@jiqing-feng made their first contribution in #2652
@zhangsheng377 made their first contribution in #2667
@rb-synth made their first contribution in #2654
@drhead made their first contribution in #2685

Full Changelog: https://github.com/huggingface/acce...

Contributors

drhead, zhangsheng377, and 16 other contributors

Assets 2

17 Apr 15:46

muellerzr

v0.29.3

e82de12

v0.29.3: Patchfix

Fixes issue with backend refactor not working on CPU-based distributed environments by @jiqing-feng: #2670
Fixes issue where load_checkpoint_and_dispatch needs a strict argument
by @SunMarc: #2641

Full Changelog: v0.29.2...v0.29.3

Contributors

SunMarc and jiqing-feng

Assets 2

09 Apr 12:04

muellerzr

v0.29.2

39e0a8e

v0.29.2: Patchfix

Fixes xpu missing parenthesis #2639
Fixes XLA and performance degradation on init with the state #2634

Assets 2

05 Apr 17:09

muellerzr

v0.29.1

2107783

v0.29.1: Patchfix

Fixed an import which would cause running accelerate CLI to fail if pytest wasn't installed

Assets 2

05 Apr 14:27

muellerzr

v0.29.0

ec88c8f

v0.29.0: NUMA affinity control, MLU Support, and DeepSpeed Improvements

Core

Accelerate can now optimize NUMA affinity, which can help increase throughput on NVIDIA multi-GPU systems. To enable it either follow the prompt during accelerate config, set the ACCELERATE_CPU_AFFINITY=1 env variable, or manually using the following:

from accelerate.utils import set_numa_affinity

# For GPU 0
set_numa_affinity(0)

Big thanks to @stas00 for the recommendation, request, and feedback during development

Allow for setting deterministic algorithms in set_seed by @muellerzr in #2569
Fixed the test script for TPU v2/v3 by @vanbasten23 in #2542
Cambricon MLU device support introduced by @huismiling in #2552
A big refactor was performed to the PartialState and AcceleratorState to allow for easier future-proofing and simplification of adding new devices by @muellerzr in #2576
Fixed a reproducibility issue in distributed environments with Dataloader shuffling when using BatchSamplerShard by @universuen in #2584
notebook_launcher can use multiple GPUs in Google Colab if using a custom instance that supports multiple GPUs by @StefanTodoran in #2561

Big Model Inference

Add log message for RTX 4000 series when performing multi-gpu inference with device_map which can lead to hanging by @SunMarc in #2557
Fix load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in #2588

DeepSpeed

Fix issue with the mapping of main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in #2495
Improve deepspeed env gen by checking for bad keys, by @muellerzr and @ricklamers in #2565
We now support custom deepspeed env files. Like normal deepspeed, set it with the DS_ENV_FILE environmental variable by @muellerzr in #2566
Resolve ZeRO-3 Initialization Failure in already-started distributed environments by @sword865 in #2578

What's Changed

Fix test_script.py on TPU v2/v3 by @vanbasten23 in #2542
Add mapping main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in #2495
split_between_processes for Dataset by @geronimi73 in #2433
Include working driver check by @muellerzr in #2558
🚨🚨🚨Move to using tags rather than latest for docker images and consolidate image repos 🚨 🚨🚨 by @muellerzr in #2554
Add Cambricon MLU accelerator support by @huismiling in #2552
Add NUMA affinity control for NVIDIA GPUs by @muellerzr in #2535
Add log message for RTX 4000 series when performing multi-gpu inference with device_map by @SunMarc in #2557
Improve deepspeed env gen by @muellerzr in #2565
Allow for setting deterministic algorithms by @muellerzr in #2569
Unpin deepspeed by @muellerzr in #2570
Rm uv install by @muellerzr in #2577
Allow for custom deepspeed env files by @muellerzr in #2566
[docs] Missing functions from API by @stevhliu in #2580
Update data_loader.py to Ensure Reproducibility in Multi-Process Environments with Dataloader Shuffle by @universuen in #2584
Refactor affinity and make it stateful by @muellerzr in #2579
Refactor and improve model estimator tool by @muellerzr in #2581
Fix load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in #2588
Guard stateful objects by @muellerzr in #2572
Expound PartialState docstring by @muellerzr in #2589
[docs] Fix kwarg docstring by @stevhliu in #2590
Allow notebook_launcher to launch to multiple GPUs from Colab by @StefanTodoran in #2561
Fix warning log for unused checkpoint keys by @fxmarty in #2594
Resolve ZeRO-3 Initialization Failure in Pre-Set Torch Distributed Environments (huggingface/transformers#28803) by @sword865 in #2578
Refactor PartialState and AcceleratorState by @muellerzr in #2576
Allow for force unwrapping by @muellerzr in #2595
Pin hub for tests by @muellerzr in #2608
Default false for trust_remote_code by @muellerzr in #2607
fix llama example for pippy by @SunMarc in #2616
Fix links in Quick Tour by @muellerzr in #2617
Link to bash in env reporting by @muellerzr in #2623
Unpin hub by @muellerzr in #2625

New Contributors

@asdfry made their first contribution in #2495
@geronimi73 made their first contribution in #2433
@huismiling made their first contribution in #2552
@universuen made their first contribution in #2584
@StefanTodoran made their first contribution in #2561
@sword865 made their first contribution in #2578

Full Changelog: v0.28.0...v0.29.0

Contributors

sword865, ricklamers, and 11 other contributors

Assets 2

12 Mar 16:58

muellerzr

v0.28.0

9e72c61

v0.28.0: DataLoaderConfig, XLA improvements, FSDP + QLORA foundations, Gradient Synchronization Tweaks, and Bug Fixes

Core

Introduce a DataLoaderConfiguration and begin deprecation of arguments in the Accelerator

+from accelerate import DataLoaderConfiguration
+dl_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dispatch_batches=True)
+accelerator = Accelerator(dataloader_config=dl_config)

Allow gradients to be synced each data batch while performing gradient accumulation, useful when training in FSDP by @fabianlim in #2531

from accelerate import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(
+    num_steps=2, 
    sync_each_batch=sync_each_batch
)
accelerator = Accelerator(gradient_accumulation_plugin=plugin)

Torch XLA

Support for XLA on the GPU by @anw90 in #2176
Enable gradient accumulation on TPU in #2453

FSDP

Support downstream FSDP + QLORA support through tweaks by allowing configuration of buffer precision by @pacman100 in #2544

`launch` changes

Support mpirun for multi-cpu training by @dmsuehir in #2493

What's Changed

Fix model metadata issue check by @muellerzr in #2435
Use py 3.9 by @muellerzr in #2436
Fix seedable sampler logic and expound docs by @muellerzr in #2434
Fix tied_pointers_to_remove type by @fxmarty in #2439
Make test assertions more idiomatic by @akx in #2420
Prefer is_torch_tensor over hasattr for torch.compile. by @PhilJd in #2387
Enable more Ruff lints & fix issues by @akx in #2419
Fix warning when dispatching model by @SunMarc in #2442
Make torch xla available on GPU by @anw90 in #2176
Include pippy_file_path by @muellerzr in #2444
[Big deprecation] Introduces a DataLoaderConfig by @muellerzr in #2441
Check for None by @muellerzr in #2452
Fix the pytest version to be less than 8.0.1 by @BenjaminBossan in #2461
Fix wrong is_namedtuple implementation by @fxmarty in #2475
Use grad-accum on TPU by @muellerzr in #2453
Add pre-commit configuration by @akx in #2451
Replace os.path.sep.join path manipulations with a helper by @akx in #2446
DOC: Fixes to Accelerator docstring by @BenjaminBossan in #2443
Context manager fixes by @akx in #2450
Fix TPU with new XLA device type by @will-cromar in #2467
Free mps memory by @SunMarc in #2483
[FIX] allow Accelerator to detect distributed type from the "LOCAL_RANK" env variable for XPU by @faaany in #2473
Fix CI tests due to pathlib issues by @muellerzr in #2491
Remove all cases of torchrun in tests and centralize as accelerate launch by @muellerzr in #2498
Fix link typo by @SunMarc in #2503
[docs] Accelerator API by @stevhliu in #2465
Docstring fixup by @muellerzr in #2504
[docs] Divide training and inference by @stevhliu in #2466
add custom dtype INT2 by @SunMarc in #2505
quanto compatibility for cpu/disk offload by @SunMarc in #2481
[docs] Quicktour by @stevhliu in #2456
Check if hub down by @muellerzr in #2506
Remove offline stuff by @muellerzr in #2509
Fixed 0MiB bug in convert_file_size_to_int by @StoyanStAtanasov in #2507
Fix edge case in infer_auto_device_map when dealing with buffers by @SunMarc in #2511
[docs] Fix typos by @omahs in #2490
fix typo in launch.py (----main_process_port to --main_process_port) by @DerrickWang005 in #2516
Add copyright + some ruff lint things by @muellerzr in #2523
Don't manage PYTORCH_NVML_BASED_CUDA_CHECK when calling accelerate.utils.imports.is_cuda_available() by @luiscape in #2524
Quanto compatibility with QBitsTensor by @SunMarc in #2526
Remove unnecessary env=os.environ.copy()s by @akx in #2449
Launch mpirun from accelerate launch for multi-CPU training by @dmsuehir in #2493
Enable using dash or underscore for CLI args by @muellerzr in #2527
Update the default behavior of zero_grad(set_to_none=None) to align with PyTorch by @yongchanghao in #2472
Update link to dynamo/compile doc by @WarmongeringBeaver in #2533
Check if the buffers fit GPU memory after device map auto inferred by @notsyncing in #2412
[Refactor] Refactor send_to_device to treat tensor-like first by @vmoens in #2438
Overdue email change... by @muellerzr in #2534
[docs] Troubleshoot by @stevhliu in #2538
Remove extra double-dash in error message by @drscotthawley in #2541
Allow Gradients to be Synced Each Data Batch While Performing Gradient Accumulation by @fabianlim in #2531
Update FSDP mixed precision setter to enable fsdp+qlora by @pacman100 in #2544
Use uv instead of pip install for github CI by @muellerzr in #2546

New Contributors

@anw90 made their first contribution in #2176
@StoyanStAtanasov made their first contribution in #2507
@omahs made their first contribution in #2490
@DerrickWang005 made their first contribution in #2516
@luiscape made their first contribution in #2524
@dmsuehir made their first contribution in #2493
@yongchanghao made their first contribution in #2472
@WarmongeringBeaver made their first contribution in #2533
@vmoens made their first contribution in #2438
@drscotthawley made their first contribution in #2541
@fabianlim made their first contribution in #2531

Full Changelog: v0.27.2...v0.28.0

Contributors

akx, luiscape, and 20 other contributors

Assets 2

09 Feb 16:30

muellerzr

v0.27.0

b7087be

v0.27.0: PyTorch 2.2.0 Support, PyTorch-Native Pipeline Parallism, DeepSpeed XPU support, and Bug Fixes

PyTorch 2.2.0 Support

With the latest release of PyTorch 2.2.0, we've guaranteed that there are no breaking changes regarding it

PyTorch-Native Pipeline Parallel Inference

With this release we are excited to announce support for pipeline-parallel inference by integrating PyTorch's PiPPy framework (so no need to use Megatron or DeepSpeed)! This supports automatic model-weight splitting to each device using a similar API to device_map="auto". This is still under heavy development, however the inference side is stable enough that we are ready for a release. Read more about it in our docs and check out the example zoo.

Requires pippy of version 0.2.0 or later (pip install torchpippy -U)

Example usage (combined with accelerate launch or torchrun):

from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model = prepare_pippy(model, split_points="auto", example_args=(input,))
input = input.to("cuda:0")
with torch.no_grad():
    output = model(input)
# The outputs are only on the final process by default
# You can pass in `gather_outputs=True` to prepare_pippy to
# make them available on all processes
if PartialState().is_last_process:
    output = torch.stack(tuple(output[0]))
    print(output.shape)

DeepSpeed

This release provides support for utilizing DeepSpeed on XPU devices thanks to @faaany

What's Changed

Convert model.hf_device_map back to Dict by @SunMarc in #2326
Fix model memory issue by @muellerzr in #2327
Fixed typos in readme files of docs folder. by @rishit5 in #2329
Disable P2P in just the 4000 series by @muellerzr in #2332
Avoid duplicating memory for tied weights in dispatch_model, and in forward with offloading by @fxmarty in #2330
Show DeepSpeed option when multi-XPU is selected in accelerate config by @faaany in #2346
FIX: add oneCCL environment variable for non-MPI launcher (accelerate launch) by @faaany in #2339
device agnostic test_accelerator/test_multigpu by @wangshuai09 in #2343
Fix mpi4py/failing deepspeed test issues by @muellerzr in #2353
Fix block_size picking in megatron_lm_gpt_pretraining example. by @nilq in #2342
Fix dispatch_model with tied weights test on T4 by @fxmarty in #2354
bugfix to allow usage of TE or MSAMP in FP8RecipeKwargs by @sudhakarsingh27 in #2355
Pin DeepSpeed until patch by @muellerzr in #2366
Remove init_hook_kwargs by @fxmarty in #2365
device agnostic optimizer testing by @statelesshz in #2363
add_hook_to_module and remove_hook_from_module compatibility with fx.GraphModule by @fxmarty in #2369
Adding requires_grad to kwargs when registering empty parameters. by @BlackSamorez in #2376
Add adapter_only option to save_fsdp_model and load_fsdp_model to only save/load PEFT weights by @AjayP13 in #2321
device agnostic cli/data_loader/grad_sync/kwargs_handlers/memory_utils testing by @wangshuai09 in #2356
Fix batch_size sanity check logic for split_batches by @izhx in #2344
Pin Torch version to <2.2.0 by @Rocketknight1 in #2394
Address PIP-632 deprecation of distutils by @AieatAssam in #2388
[don't merge yet] unpin torch by @ydshieh in #2406
Revert "[don't merge yet] unpin torch" by @muellerzr in #2407
Fix CI due to pytest by @muellerzr in #2408
Added activateEnviroment.sh to readme by @TJ-Solergibert in #2409
Fix XPU inference by @notsyncing in #2383
Fix the size of int and bool type when computing module size by @notsyncing in #2411
Adding Local SGD support for NPU by @statelesshz in #2415
Unpin torch by @muellerzr in #2418
Use Ruff for formatting too by @akx in #2400
torch-native pipeline parallelism for big models by @muellerzr in #2345
Update FSDP docs by @pacman100 in #2430
Make output end up on all GPUs at the end by @muellerzr in #2423
Migrate pippy examples over and run tests by @muellerzr in #2424
[FIX] fix the wrong nproc_per_node in the multi gpu test by @faaany in #2422
Fix fp8 things by @muellerzr in #2403
[FIX] allow Accelerator to prepare models in eval mode for XPU&CPU by @faaany in #2426
[Fix] make all tests pass on XPU by @faaany in #2427