# Vitron
**Repository Path**: windhxs/Vitron
## Basic Information
- **Project Name**: Vitron
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-02-12
- **Last Updated**: 2025-02-12
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
#
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
**NeurIPS 2024 Paper**
[Hao Fei](http://haofei.vip/)$^{1,2}$, [Shengqiong Wu](https://chocowu.github.io/)$^{1,2}$, [Hanwang Zhang](https://personal.ntu.edu.sg/hanwangzhang/)$^{1,3}$, [Tat-Seng Chua](https://www.chuatatseng.com/)$^{2}$, [Shuicheng Yan](https://yanshuicheng.info/)$^{1}$
**▶ $^{1}$ Skywork AI, Singapore ▶ $^{2}$ National University of Singapore ▶ $^{3}$ Nanyang Technological University**

[](https://youtu.be/wiGMJzoQVu4)
## 📰 News
* **[2024.09.26]** Excited that this work has been accepted by NeurIPS 2024.
* **[2024.07.19]** We release the [Dataset](data/README.md) constructed for `Text Invocation Instruction Tuning`.
* **[2024.06.28]** 🤗 We release the checkpoint, refer to [README](checkpoints/README.md) for more details.
* **[2024.04.04]** 👀👀👀 Our [Vitron](https://vitron-llm.github.io/) is available now! Welcome to **watch** 👀 this repository for the latest updates.
## 😮 Highlights
Existing vision LLMs might still encounter challenges such as superficial instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. To fill the gaps, we present Vitron, a universal pixel-level vision LLM, designed for comprehensive understanding (perceiving and reasoning), generating, segmenting (grounding and tracking), editing (inpainting) of both static image and dynamic video content.
## 🛠️ Requirements and Installation
* Python >= 3.8
* Pytorch == 2.1.0
* CUDA Version >= 11.8
* Install required packages:
```bash
git clone https://github.com/SkyworkAI/Vitron
cd Vitron
conda create -n vitron python=3.10 -y
conda activate vitron
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d
```
🔥🔥🔥 Installation or Running Fails? 🔥🔥🔥
1. When running ffmpeg, `Unknown encoder 'x264'`:
- try to re-install ffmpeg:
```
conda uninstall ffmpeg
conda install -c conda-forge ffmpeg # `-c conda-forge` can not omit
```
2. Fail to install detectron2, try this command:
```
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
```
or refer this [Website](https://detectron2.readthedocs.io/en/latest/tutorials/install.html).
3. Error in gradio. As there are a big update in `gradio>=4.0.0`, please make sure install gradio with the same verion in `requirements.txt`.
4. Error with deepspeed. If you fine-tune your model, this error occours:
```
FAILED: cpu_adam.so
/usr/bin/ld: cannot find -lcurand
```
This error is caused by the wrong soft links when installing deepspeed. Please try to the following command to solve the error:
```
cd ~/miniconda3/envs/vitron/lib
ls -al libcurand* # check the links
rm libcurand.so # remove the wrong links
ln -s libcurand.so.10.3.5.119 libcurand.so # build new links
```
Double check again:
```
python
from deepspeed.ops.op_builder import CPUAdamBuilder
ds_opt_adam = CPUAdamBuilder().load() # if loading successfully, then deepspeed are installed successfully.
```
## Code Structure
```
.
├── assets
├── checkpoints # saving the pre-trained checkpoints
├── data
├── examples
├── modules # each modules used in our project
│ ├── GLIGEN
│ ├── i2vgen-xl
│ ├── SEEM
│ └── StableVideo
├── scripts
└── vitron
├── model
│ ├── language_model
│ ├── multimodal_encoder
│ ├── multimodal_projector
│ └── region_extractor
└── train
```
## 👍 Deploying Gradio Demo
* Firstly, you need to prepare the checkpoint, see [README]() for more details.
* Then, you can run the demo locally via:
```
python app.py
```
## Fine-tuning your model
- Firstly, prepare the dataset.
We release the constructed dataset for `Invocation-oriented Instruction Tuning`. Please refer for the [README](data/README.md) for more details.
- Then, modify the `image/video/data` path in [finetune_lora.sh](scripts/finetune_lora.sh).
```
JSON_FOLDER=None
IMAGE_FOLDER=None
VIDEO_FOLDER=None
DATA_PATH="./data/data.json"
```
- Next, prepare the [checkpoint](checkpoints/README.md).
- Finally, run the code:
```
bash scripts/fine_lora.sh
```
## 🙌 Related Projects
You may refer to related work that serves as foundations for our framework and code repository,
[Vicuna](https://github.com/lm-sys/FastChat),
[SEEM](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once),
[i2vgenxl](https://github.com/ali-vilab/VGen),
[StableVideo](https://github.com/rese1f/StableVideo), and
[Zeroscope](https://huggingface.co/cerspense/zeroscope_v2_576w).
We also partially draw inspirations from
[Video-LLaVA](https://github.com/PKU-YuanGroup/Video-LLaVA),
and [LanguageBind](https://github.com/PKU-YuanGroup/LanguageBind).
Thanks for their wonderful works.
## 🔒 License
* The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](https://github.com/SkyworkAI/Vitron/blob/main/LICENSE) file.
* The service is a research preview intended for non-commercial use only, subject to the model [License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA, [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and [Privacy Practices](https://chrome.google.com/webstore/detail/sharegpt-share-your-chatg/daiacboceoaocpibfodeljbdfacokfjb) of ShareGPT. Please contact us if you find any potential violation.
## ✏️ Citation
If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.
```BibTeX
@inproceedings{fei2024vitron,
title={VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing},
author={Fei, Hao and Wu, Shengqiong and Zhang, Hanwang and Chua, Tat-Seng and Yan, Shuicheng},
year={2024},
journal={Proceedings of the Advances in neural information processing systems},
}
```
## ✨ Star History
[](https://star-history.com/#SkyworkAI/Vitron&Date)