# VAN-Classification

**Repository Path**: kaierlong/VAN-Classification

## Basic Information

- **Project Name**: VAN-Classification
- **Description**: Visual-Attention-Network
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2022-05-26
- **Last Updated**: 2025-04-03

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Visual Attention Network (VAN)  [paper pdf](https://arxiv.org/pdf/2202.09741.pdf)

This is a PyTorch implementation of **VAN** proposed by our paper "**Visual Attention Network**".

![Comparsion](./images/Comparsion.png)

Figure 1: **Compare with different vision backbones on ImageNet-1K validation set.** 


## Citation:
```
@article{guo2022visual,
  title={Visual Attention Network},
  author={Guo, Meng-Hao and Lu, Cheng-Ze and Liu, Zheng-Ning and Cheng, Ming-Ming and Hu, Shi-Min},
  journal={arXiv preprint arXiv:2202.09741},
  year={2022}
}
```

## News:

### 2022.02.22 Release paper on ArXiv.

### 2022.03.15 Supported by [Hugging Face](https://github.com/huggingface/transformers).

### 2022.05.01 Supported by [OpenMMLab](https://github.com/open-mmlab/mmclassification).

### Abstract: 

While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision transformers (ViTs) and convolutional neural networks (CNNs) with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc.

![Decomposition](./images/decomposition.png)

Figure 2: Decomposition diagram of large-kernel convolution. A standard convolution can be decomposed into three parts: a depth-wise convolution (DW-Conv), a depth-wise dilation convolution (DW-D-Conv) and a 1×1 convolution (1×1 Conv). 


![LKA](./images/LKA.png)

Figure 3: The structure of different modules: (a) the proposed Large Kernel Attention (LKA); (b) non-attention module; (c) the self-attention module (d) a stage of our Visual Attention Network (VAN). CFF means convolutional feed-forward network. The difference between (a) and (b) is the element-wise multiply. It is worth noting that (c) is designed for 1D sequences. .

## Image Classification

Data prepare: ImageNet with the following folder structure.

```
│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......
```


### 2. VAN Models

| Model     | #Params(M) | GFLOPs | Top1 Acc(%) |                           Download                           |
| :-------- | :--------: | :----: | :---------: | :----------------------------------------------------------: |
| VAN-Tiny  |    4.1     |  0.9   |    75.4     |[Google Drive](https://drive.google.com/file/d/1KYoIe1Zl3ZaPCwRuvnpkLyOEK04JKemu/view?usp=sharing), [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/aada2242a16245d6a561/?dl=1), [Hugging Face 🤗](https://huggingface.co/Visual-Attention-Network/VAN-Tiny-original) |
| VAN-Small |    13.9    |  2.5   |    81.1     |[Google Drive](https://drive.google.com/file/d/1LFsJHwxAs1TcXAjJ28G86_jwYwV8DzuG/view?usp=sharing), [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/dd3eb73692f74a2499c9/?dl=1), [Hugging Face 🤗](https://huggingface.co/Visual-Attention-Network/VAN-Small-original) |
| VAN-Base  |    26.6    |  5.0   |    82.8     |[Google Drive](https://drive.google.com/file/d/1qApsgXCbngNYOji2UzJsfeEsPOu6dBo3/view?usp=sharing), [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/58e7acceaf334ecdba89/?dl=1),[Hugging Face 🤗](https://huggingface.co/Visual-Attention-Network/VAN-Base-original), |
| VAN-Large |    44.8    |  9.0   |    83.9     |[Google Drive](https://drive.google.com/file/d/10n6u-W3IrqiCD-7wkotejV_1XiS9kuWF/view?usp=sharing), [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/0201745f6920482490a0/?dl=1), [Hugging Face 🤗](https://huggingface.co/Visual-Attention-Network/VAN-Large-original) |
| VAN-Huge  |    TODO    |  TODO  |    TODO     |                             TODO                             |

Unofficial [keras (tensorflow)](https://github.com/shkarupa-alex/tfvan) version.


### 3.Requirement

```
1. Pytorch >= 1.7
2. timm == 0.4.12
```

### 4. Train 

We use 8 GPUs for training by default.  Run command (It has been writen in train.sh):

```bash
MODEL=van_tiny # van_{tiny, small, base, large}
DROP_PATH=0.1 # drop path rates [0.1, 0.1, 0.1, 0.2] for [tiny, small, base, large]
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash distributed_train.sh 8 /path/to/imagenet \
	  --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH
```


### 5. Validate

Run command (It has been writen in eval.sh) as:


```bash
MODEL=van_tiny # van_{tiny, small, base, large}
python3 validate.py /path/to/imagenet --model $MODEL \
  --checkpoint /path/to/model -b 128

```

## 6.Acknowledgment

Our implementation is mainly based on [pytorch-image-models](https://github.com/rwightman/pytorch-image-models) and [PoolFormer](https://github.com/sail-sg/poolformer). Thanks for their authors. 


## LICENSE

This repo is under the Apache-2.0 license. For commercial use, please contact the authors.