# benchy **Repository Path**: whn09/benchy ## Basic Information - **Project Name**: benchy - **Description**: Import from github - **Primary Language**: Python - **License**: BSD-3-Clause - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-10-19 - **Last Updated**: 2022-10-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # benchy Benchmarking Dataloader Wrapper for DL ## What is it? `benchy` is a simple tool that can capture and report throughput metrics for DL training workloads through wrapping your dataloading iterator. When used, the tool will run, measure, and report the throughput of the following in samples per second: 1. `IO`: your dataloader running in isolation 2. `SYNTHETIC`: your training workload when provided synthetic (or cached) data samples 3. `FULL`: your training worload when provided real data samples Comparing these thoughputs can help highlight what is bottlenecking your workload and help focus optimization efforts. This tool is being used for the Deep Learning at Scale tutorial at SC21 ([link](https://github.com/NERSC/sc21-dl-tutorial)). However, it could be useful in other workloads and is available here. ## How to use? The tool currently supports PyTorch Dataloader iterators and other similar Python iterators (e.g. iterators from the NVIDIA DALI library). For PyTorch dataloaders, `benchy.torch.BenchmarkDataLoader` can be used as a stand-in replacement as follows: * Using `torch.utils.data.DataLoader`: ``` from torch.utils.data import DataLoader train_loader = DataLoader(dataset, batch_size) ``` * Using `benchy.torch.BenchmarkDataLoader`: ``` from benchy.torch import BenchmarkDataLoader train_loader = BenchmarkDataLoader(dataset, batch_size) ``` For other Python iterators, you can use `benchy.torch.BenchmarkGenericIteratorWrapper` as follows: ``` train_loader = CustomDataLoader(dataset, batch_size) train_loader = benchy.torch.BenchmarkGenericIteratorWrapper(train_loader, batch_size) ``` With this in place, `benchy` will override the dataloader behavior to generate throughput numbers for the `IO`, `SYNTHETIC`, and `FULL` scenarios. At the end of a successful run with this tool, a summary will be printed to the terminal reporting the measured throughputs: ``` BENCHY::SUMMARY::IO average throughput: 8.808 +/- 0.132 BENCHY::SUMMARY:: SYNTHETIC average throughput: 19.253 +/- 0.035 BENCHY::SUMMARY::FULL average throughput: 8.465 +/- 0.211 ``` Additionally, a JSON file will be output containing measured throughput values for postprocessing/plotting. See `sample_benchy_conf.yaml` for available configuration options (e.g. number of trials to run, report frequency, etc.) and `benchy/__init__.py:_get_default_config` for defaults. To override defaults, set environment variable `BENCHY_CONFIG_FILE=` when running. Note that each trial in `benchy` acts like a full training epoch in your script. For correct behavior, set your training options to perform enough epochs to cover the number of trial and warmup trials requested in your configuration. ## Profiling Besides throughput measurements, `benchy` also has useful features for NVIDIA Nsight Systems command-line profiling (`nsys profile`): 1. Adds NVTX annotations to label training iterations, data loading time, and the duration of the different measured trials. 2. Controls if profiling is started on a single or all GPUs when used with the `--capture-range cudaProfilerApi` flag (see `profiler_mode` configuration option). This can be useful when running multi-GPU and you want to limit the profiling output.