# CatV2TON **Repository Path**: Azreal_zhou/CatV2TON ## Basic Information - **Project Name**: CatV2TON - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-02-04 - **Last Updated**: 2026-02-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation
arxiv huggingface GitHub License
## Updates - **`2025/02/24`**: We have released both 256 and 512 model weights, and provided [inference scripts](#inference). Check out our [HuggingFace repo](https://huggingface.co/zhengchong/CatV2TON) for the weights. - **`2025/01/20`**: Our paper has been published on **[ArXiv](http://arxiv.org/abs/2501.11325v1)**. ## Overview
CatV2TON is a DiT-based method for Vision-Based Virtual Try-On (V2TON) with Temporal Concatenation of Video Frames and Garment Condition. ## Evaluation ### Evaluation for Image Try-On We provide the evaluation script for VITONHD and DressCode datasets. You can download our generated [VITONHD](https://drive.google.com/file/d/1ol2M6x918lDH6bawpsiJea6DNbTNf0Cc/view?usp=share_link) and [DressCode](https://drive.google.com/file/d/1kSjofynJM13ccxn-t69z4WuqO333ahXX/view?usp=share_link) results to evaluate the performance of our method. Or you can infer your own results following the [Inference](#inference) section which may be slightly different due to the randomness of the inference process. ```bash CUDA_VISIBLE_DEVICES=0 python eval_image_metrics.py \ --gt_folder YOUR_GT_FOLDER \ --pred_folder YOUR_PRED_FOLDER \ --batch_size 16 \ --num_workers 16 \ --paired ``` ### Evaluation for Video Try-On We provide the evaluation script for ViViD-S-Test and VVT-Test datasets. You can download our generated [ViViD-S-Test](https://drive.google.com/file/d/1tvcDe3Z4ES6VGtpS_OI1EG155xZgjfC5/view?usp=share_link) and [VVT-Test](https://drive.google.com/file/d/1Gh8YRBsdV3BeKEXR91CNI-UU2j1fMYnt/view?usp=share_link) results to evaluate the performance of our method. Or you can infer your own results following the [Inference](#inference) section which may be slightly different due to the randomness of the inference process. ```bash CUDA_VISIBLE_DEVICES=0 python eval_video_metrics.py \ --gt_folder YOUR_GT_FOLDER \ --pred_folder YOUR_PRED_FOLDER \ --num_workers 16 \ --paired ``` `YOUR_GT_FOLDER` is the path to the ground truth video folder which includes only `mp4` files. `YOUR_PRED_FOLDER` is the path to the predicted video folder which includes only `mp4` files. ## Inference ### Inference for Image Try-On We provide the inference script for VITONHD and DressCode datasets.\ The datasets can be downloaded from [VITONHD](https://github.com/shadow2496/VITON-HD) and [DressCode](https://github.com/aimagelab/dress-code). You can run the following command to do inference with some edited parameters for your own settings. ```bash CUDA_VISIBLE_DEVICES=0 python eval_image_try_on.py \ --dataset vitonhd | dresscode \ --data_root_path YOUR_DATASET_PATH \ --output_dir OUTPUT_DIR_TO_SAVE_RESULTS \ --dataloader_num_workers 8 \ --batch_size 8 \ --seed 42 \ --mixed_precision bf16 \ --allow_tf32 \ --repaint \ --eval_pair ``` ### Inference for Video Try-On The Video Try-On Test datasets are provided: [ViViD-S-Test](https://drive.google.com/file/d/12QDkjn30P9EiIqZhtCFL4pEi7oZj2psQ/view?usp=share_link) and [VVT](https://drive.google.com/file/d/1mQaHP99c4CWLrVjPZEL_07OnW26z8xs2/view?usp=share_link). You can run the following command to do inference with some edited parameters for your own settings. ```bash CUDA_VISIBLE_DEVICES=0 python eval_video_try_on.py \ --dataset vivid | vvt \ --data_root_path YOUR_DATASET_PATH \ --output_dir OUTPUT_DIR_TO_SAVE_RESULTS \ --dataloader_num_workers 8 \ --batch_size 8 \ --seed 42 \ --mixed_precision bf16 \ --allow_tf32 \ --repaint \ --eval_pair ```