README.md

insights/ai-battlefield.md


compute/README.md
compute/accelerator/README.md
compute/accelerator/benchmarks/README.md
compute/accelerator/nvidia/debug.md
compute/accelerator/amd/debug.md
compute/accelerator/amd/performance.md
compute/cpu/README.md
compute/cpu-memory/README.md


storage/README.md
storage/benchmarks/results/hope-2023-12-20-14-37-02-331702-summary.md


network/README.md
network/debug/README.md
network/benchmarks/README.md
network/benchmarks/results/README.md
network/benchmarks/results/disable-nvlink.md


orchestration/slurm/README.md
orchestration/slurm/admin.md
orchestration/slurm/users.md
orchestration/slurm/performance.md
orchestration/slurm/launchers/README.md


training/README.md
training/model-parallelism/README.md
training/performance/README.md
training/fault-tolerance/README.md
training/reproducibility/README.md
training/instabilities/README.md
training/instabilities/training-loss-patterns.md
training/checkpoints/README.md
training/hparams.md
training/dtype.md
training/emulate-multi-node.md
training/re-train-hub-models.md

inference/README.md

debug/README.md
debug/pytorch.md
debug/tools.md
debug/torch-distributed-hanging-solutions.md
debug/underflow_overflow.md
debug/make-tiny-models-tokenizers-datasets.md
debug/tiny-scripts/README.md

testing/README.md

resources/README.md

contributors.md

build/README.md
