Skip to content

Latest commit

 

History

History
59 lines (35 loc) · 858 Bytes

sanity-checks.md

File metadata and controls

59 lines (35 loc) · 858 Bytes

Sanity Checks

When configuring the slurm script must ensure the following is strictly exact:

players:

  • NHIDDEN
  • NHEADS
NHIDDEN % NHEADS == 0

players:

  • GLOBAL_BATCH_SIZE
  • MICRO_BATCH_SIZE
  • DP_SIZE
GLOBAL_BATCH_SIZE % (MICRO_BATCH_SIZE * DP_SIZE) == 0

players:

  • NLAYERS
  • PP_SIZE
NLAYERS % PP_SIZE == 0
  1. Curriculum Learning Constraints

  • min_difficulty % 8 = 0 (to enable Tensor Core acceleration)

  • json ds config can't have numbers with '_' in them - invalid json - careful with substitutions.

Restaring from existing checkpoint constraints

XXX: quite a few of these - need to start collecting them all

  • can't change TP-size (But ok to change PP)

  • can't change max-lr or will get:

AnnealingLR: class input value 1e-05 and checkpointvalue 3e-05 for learning rate do not match