Huihuo Zheng

Personal webpage for Huihuo Zheng

Running LLM Checkpointing Benchmark using DLIO

Author

Huihuo Zheng, Argonne National Laboratory
Email: huihuo.zheng@anl.gov
Date: February 7, 2025

Overview

In LLM training, I/O occurs in two phases:

Reading the dataset for training – Similar to traditional deep learning training, where computation dominates, making dataset reading less intensive.
Checkpointing intermediate results – More significant due to the large size of LLM models, often involving writes in the order of TBs. Large-scale training requires storage that supports high concurrency.

We have integrated this workload into the DLIO codebase, allowing users to benchmark checkpoint performance for different LLM models. We also plan to include this checkpointing benchmark in MLPerf Storage Benchmark v2.0.

Running the Benchmark

Development work is available in this PR (to be merged soon):
GitHub Pull Request

Commands to Run the Benchmark

git clone -b feature/transformer git@github.com:argonne-lcf/dlio_benchmark.git
cd dlio_benchmark

# Ensure Python 3.10 is used if issues arise with Python 3.12
python3 -m pip install --upgrade pip

CC=mpicc CXX=mpicxx python -m pip install -r ./requirements.txt
# Remove nvidia-dali-* *pydftracer from requirements.txt if they cause issues

export PYTHONPATH=$PWD:$PYTHONPATH

By default:

Checkpoint files are saved in ./checkpoints/{workload}/
Log files are saved in hydra_log/{workload}/

Example Execution

mpiexec -np 128 --ppn 8 --cpu-bind list:xxxxx python3 dlio_benchmark/main.py workload=llama_70b_zero3
mpiexec -np 128 --ppn 8 --cpu-bind list:xxxxx python3 dlio_benchmark/main.py workload=llama_7b_zero3
mpiexec -np 128 --ppn 8 --cpu-bind list:xxxxx python3 dlio_benchmark/main.py workload=llama_405b
mpiexec -np 128 --ppn 8 --cpu-bind list:xxxxx python3 dlio_benchmark/main.py workload=llama_1t

Benchmark Results

Results are stored in summary.json, which includes key metrics:

{
  "num_accelerators": 1024,
  "num_hosts": 128,
  "hostname": "x1000c0s5b1n0",
  "metric": {
    "checkpoint_io_mean_GB_per_second": 132.316,
    "checkpoint_io_stdev_GB_per_second": 5.972,
    "checkpoint_duration_mean_seconds": 7.897,
    "checkpoint_duration_stdev_seconds": 0.355,
    "checkpoint_size_GB": 1042.846
  }
}

per_epoch_stats.json provides per-checkpoint statistics, including timestamps and throughput.

Model Configurations

Model	Hidden Dimension	Num Layers	FFN Size	Vocab Size	Checkpoint Size	# GPUs
Llama 7B	4096	32	11008	32k	88GB	1 x 1 x 16
Llama 8B	4096	32	11008	128k	104GB	1 x 1 x 16
Llama 70B	8192	80	28672	128k	1.1TB	8 x 4 x DP?
Llama 405B	16384	126	53248	128k	6TB	8 x 16 x 128
Llama 1T	25872	128	98304	128k	17TB	8 x 64 x DP?

Pre-configured YAML files for different transformer models (Llama 7B, 70B, 405B, 1T) can be found at:
GitHub Config Files

Example Configuration

model:
  name: llama_70b
  type: transformer
  model_size: 30102
  num_layers: 80
  parallelism:
    tensor: 8
    pipeline: 1
    zero_stage: 3
transformer:
  vocab_size: 128000
  hidden_size: 8192
  ffn_hidden_size: 28672

Performance Scaling

For benchmarking:

Fix TP (Tensor Parallelism) and PP (Pipeline Parallelism) values
Increase total GPUs (must be a multiple of TP * PP)

Expected Behavior

Zero 3 generally has better I/O performance than Zero 1 due to optimizer state sharding.
Llama_7b_zero3 should outperform llama_7b in throughput.
Increasing data parallel instances improves I/O throughput.

System Requirements

To run benchmarks, ensure memory capacity is sufficient:

Zero 3: Checkpoint_size / num_hosts
Zero 1: model_size / pp + opt_size / num_hosts
Recommended: At least 256 GB memory per host

Preliminary results on an ALCF system for llama_7b

As we can see that the throughput increases as we increase the number of MPI processes. In this case, 8 GPU per node is used. The results here were obtained with running the code out of the box. I purposely omitted the setup details.