# DataLoaders_DALI

**Repository Path**: gaomengfan/DataLoaders_DALI

## Basic Information

- **Project Name**: DataLoaders_DALI
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-05-04
- **Last Updated**: 2024-05-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# PyTorch DataLoaders with DALI

PyTorch DataLoaders implemented with [nvidia-dali](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html), we've implemented **CIFAR-10** and **ImageNet** dataloaders, more dataloaders will be added in the future.

With 2 processors of Intel(R) Xeon(R) Gold 6154 CPU, 1 Tesla V100 GPU and all dataset in memory disk, we can **extremely** **accelerate image preprocessing** with DALI.

| Iter Training Data Cost(bs=256) | CIFAR-10 | ImageNet |
| :-----------------------------: | :------: | :------: |
|              DALI               |   1.4s(2 processors)   | 625s(8 processors)  |
|           torchvision           |  280.1s(2 processors)  | 13400s(8 processors)  |

In CIFAR-10 training, we can reduce tranining time **from** **1 day to 1 hour** with our hardware setting.

## Requirements

You only need to install nvidia-dali package and version should be >= 0.12, we've tested version 0.11 and it didn't work

```bash
#for cuda9.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/9.0 nvidia-dali
#for cuda10.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/10.0 nvidia-dali
```

More details and documents can be found [here](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html#)

## Usage

You can use these dataloaders easily as the following example

```python
from base import DALIDataloader
from cifar10 import HybridTrainPipe_CIFAR
pip_train = HybridTrainPipe_CIFAR(batch_size=TRAIN_BS,
                                  num_threads=NUM_WORKERS,
                                  device_id=0, 
                                  data_dir=IMG_DIR, 
                                  crop=CROP_SIZE, 
                                  world_size=1, 
                                  local_rank=0, 
                                  cutout=0)
train_loader = DALIDataloader(pipeline=pip_train,
                              size=CIFAR_IMAGES_NUM_TRAIN, 
                              batch_size=TRAIN_BS, 
                              onehot_label=True)
for i, data in enumerate(train_loader): # Using it just like PyTorch dataloader
    images = data[0].cuda(non_blocking=True)
    labels = data[1].cuda(non_blocking=True)
```

If you have large enough memory for storing dataset, we strongly recommend you to mount a memory disk and put the whole dataset in it to accelerate I/O, like this

```bash
mount  -t tmpfs -o size=20g  tmpfs /userhome/memory_data
```

It's noteworthy that `20g` above is a ceiling but **not** occupying `20g` memory at the moment you mount the tmpfs, memories are occupied as you putting dataset in it. Compressed files should **not** be extracted before you've copied them into memory, otherwise it could be much slower.