Huihuo Zheng

Personal webpage for Huihuo Zheng

Running LLM Checkpointing Benchmark using DLIO

Author

Huihuo Zheng, Argonne National Laboratory
Email: huihuo.zheng@anl.gov
Date: February 7, 2025

Overview

In LLM training, I/O occurs in two phases:

  1. Reading the dataset for training – Similar to traditional deep learning training, where computation dominates, making dataset reading less intensive.
  2. Checkpointing intermediate results – More significant due to the large size of LLM models, often involving writes in the order of TBs. Large-scale training requires storage that supports high concurrency.

We have integrated this workload into the DLIO codebase, allowing users to benchmark checkpoint performance for different LLM models. We also plan to include this checkpointing benchmark in MLPerf Storage Benchmark v2.0.

Running the Benchmark

Development work is available in this PR (to be merged soon):
GitHub Pull Request

Commands to Run the Benchmark

git clone -b feature/transformer git@github.com:argonne-lcf/dlio_benchmark.git
cd dlio_benchmark

# Ensure Python 3.10 is used if issues arise with Python 3.12
python3 -m pip install --upgrade pip

CC=mpicc CXX=mpicxx python -m pip install -r ./requirements.txt
# Remove nvidia-dali-* *pydftracer from requirements.txt if they cause issues

export PYTHONPATH=$PWD:$PYTHONPATH

By default:

Example Execution

mpiexec -np 128 --ppn 8 --cpu-bind list:xxxxx python3 dlio_benchmark/main.py workload=llama_70b_zero3
mpiexec -np 128 --ppn 8 --cpu-bind list:xxxxx python3 dlio_benchmark/main.py workload=llama_7b_zero3
mpiexec -np 128 --ppn 8 --cpu-bind list:xxxxx python3 dlio_benchmark/main.py workload=llama_405b
mpiexec -np 128 --ppn 8 --cpu-bind list:xxxxx python3 dlio_benchmark/main.py workload=llama_1t

Benchmark Results

Results are stored in summary.json, which includes key metrics:

{
  "num_accelerators": 1024,
  "num_hosts": 128,
  "hostname": "x1000c0s5b1n0",
  "metric": {
    "checkpoint_io_mean_GB_per_second": 132.316,
    "checkpoint_io_stdev_GB_per_second": 5.972,
    "checkpoint_duration_mean_seconds": 7.897,
    "checkpoint_duration_stdev_seconds": 0.355,
    "checkpoint_size_GB": 1042.846
  }
}

per_epoch_stats.json provides per-checkpoint statistics, including timestamps and throughput.

Model Configurations

Model Hidden Dimension Num Layers FFN Size Vocab Size Checkpoint Size # GPUs
Llama 7B 4096 32 11008 32k 88GB 1 x 1 x 16
Llama 8B 4096 32 11008 128k 104GB 1 x 1 x 16
Llama 70B 8192 80 28672 128k 1.1TB 8 x 4 x DP?
Llama 405B 16384 126 53248 128k 6TB 8 x 16 x 128
Llama 1T 25872 128 98304 128k 17TB 8 x 64 x DP?

Pre-configured YAML files for different transformer models (Llama 7B, 70B, 405B, 1T) can be found at:
GitHub Config Files

Example Configuration

model:
  name: llama_70b
  type: transformer
  model_size: 30102
  num_layers: 80
  parallelism:
    tensor: 8
    pipeline: 1
    zero_stage: 3
transformer:
  vocab_size: 128000
  hidden_size: 8192
  ffn_hidden_size: 28672

Performance Scaling

For benchmarking:

Expected Behavior

System Requirements

To run benchmarks, ensure memory capacity is sufficient:

Preliminary results on an ALCF system for llama_7b

As we can see that the throughput increases as we increase the number of MPI processes. In this case, 8 GPU per node is used. The results here were obtained with running the code out of the box. I purposely omitted the setup details.