# ml-mobileclip-dr **Repository Path**: mirrors_apple/ml-mobileclip-dr ## Basic Information - **Project Name**: ml-mobileclip-dr - **Description**: RayGen: Multi-Modal Dataset Reinforcement for MobileCLIP - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-22 - **Last Updated**: 2026-03-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # RayGen: Multi-Modal Dataset Reinforcement for MobileCLIP **For MobileCLIP and MobileCLIP2 models and demos see [ml-mobileclip](https://github.com/apple/ml-mobileclip) repository.** This repository contains data generation code used in the following papers to improve and augment multi-modal and single-modal datasets. This code can be used to run distributed inference at scale to compute the output of an ML model on a large-scale dataset and store efficiently in a new dataset. The code supports processing petabytes of data and billions of samples on 10,000 GPUs and more. The code also supports elasticity where processing can start with a minimum number of workers and gradually scaled up. - **[MobileCLIP2: Improving Multi-Modal Reinforced Training](http://arxiv.org/abs/2508.20691). (TMLR August 2025 Featured)** Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, Hadi Pouransari. - **[MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training](https://arxiv.org/pdf/2311.17049.pdf). (CVPR 2024)** *Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.* - **[Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement](https://arxiv.org/abs/2303.08983). (ICCV 2023)** Fartash Faghri, Hadi Pouransari, Sachin Mehta, Mehrdad Farajtabar, Ali Farhadi, Mohammad Rastegari, Oncel Tuzel. The repository contains the RayGen package, a package that simplifies data generation using the distributed processing framework [Ray](https://github.com/ray-project/ray). The training code and models are accessible in the following repositories - [Dataset Reinforcement (ml-dr)](https://github.com/apple/ml-dr): Single modality training on reinforced image classification datasets. - [MobileCLIP (ml-mobileclip)](https://github.com/apple/ml-mobileclip): Multi-modal reinforced training on image-text datasets. ## Installation For local development we recommend installing a Miniforge based Conda environment with Python 3.10 as the base environment. This code is developed and tested on Ubuntu 20.04-x86_64 compute environment. Create a Conda environment and install the environment dependencies. ``` conda create -n raygen python=3.10 conda activate raygen ``` Run the setup script: ``` bash scripts/setup_gpu_worker.sh # or setup_cpu_worker.sh for cpu only ``` Install `raygen` python package to the local environment using: ``` pip install -e . ``` ## Local Development To run RayGen scripts locally on a CPU or GPU machine first follow the installation instructions above. At least one GPU is needed for data generation scripts but not for dataset conversion. Additionally you will want to install the "dev" dependencies which installs Ray. ```shell pip install -e ".[dev]" ``` Once everything is installed, you can then start the local Ray cluster via: ```shell ray start --head --port=6379 ``` Next, run the generation code on a toy dataset: ```shell INPUT=s3://some_bucket/some_path/ OUTPUT=s3://some_bucket/some_path/ GPU_COUNT=$(nvidia-smi -L | wc -l) BATCH_SIZE=256 python3 scripts/gen.py \ --input $INPUT \ --output $OUTPUT \ --batch-size $BATCH_SIZE \ --min-actors $GPU_COUNT \ --verbose \ --local ``` If the batch size of 256 is too large for the local GPU then you can reduce it. I ran this with 16 cores on my host. ## Example Commands The example script [run_capgen.py](./examples/run_capgen.sh) generates synthetic captions from a CoCa model for a dataset in JSONL format and saves it as a new dataset in JSONL format. Next example script [run_embgen.py](./examples/run_embgen.sh) generates CLIP embeddings. We generate CLIP image-text embeddings from an ensemble of two CLIP models. The models are loaded from HuggingFace. The number of image augmentations is specified as 2. We also specify a regular expression to select a subset of the synthetic captions for compute synthetic text embeddings. The image augmentation parameters are also specified ## Supported sharded dataset formats The code supports datasets as input and output in JSONL format (a collection of `.jsonl` files with a single `.jsonl` file as the manifest) as well as Webdataset format (a collection of `.tar` files passed with a regular expression). Additional file formats such as `jsonl.zstd`, `jsonl.gz`, `.tfrecord` are also supported as inputs but not as output. To pass an input dataset, one can pass a path like `{bucket}/path/{00000000..00001023}.tar`. Alternatively, one may pass a `info.json` file that contains a list of all the shards in the dataset. The minimal content of such a file is `{'lengths': {'shard1.jsonl': 0, 'shard2.jsonl': 0, …}}`. As the generation code does not utilize the value of lengths, they can be set to zero. If the output dataset format is JSONL, a new `info.json` will be written that contains additional information such as `totalLength` of the dataset. ## Checkpointing The code saves a checkpoint of the generation process and by default resumes from a checkpoint if available in the path. The checkpoint is a `checkpoint.json` file containing information about processed input and output paths so far together with the number of processed samples. ## License This software and accompanying data and models have been released under the following licenses: - Code: [Apple Sample Code License (ASCL)](./LICENSE) - ML models: [Apple ML Research Model TOU](./LICENSE_MODELS) - Data: [CC-BY-NC-ND](./LICENSE_DATA) [Deed](https://creativecommons.org/licenses/by-nc-nd/4.0/) ## Acknowledgements Our codebase is built using multiple opensource contributions, please see [ACKNOWLEDGEMENTS](ACKNOWLEDGEMENTS) for more details. ## Citation If you found this code useful, please cite the following papers: @article{faghri2025mobileclip2, title={Mobile{CLIP}2: Improving Multi-Modal Reinforced Training}, author={Fartash Faghri and Pavan Kumar Anasosalu Vasu and Cem Koc and Vaishaal Shankar and Alexander T Toshev and Oncel Tuzel and Hadi Pouransari}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={https://openreview.net/forum?id=WeF9zolng8}, note={Featured Certification} } @InProceedings{vasu2024mobileclip, author = {Vasu, Pavan Kumar Anasosalu and Pouransari, Hadi and Faghri, Fartash and Vemulapalli, Raviteja and Tuzel, Oncel}, title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, } @InProceedings{faghri2023reinforce, author = {Faghri, Fartash and Pouransari, Hadi and Mehta, Sachin and Farajtabar, Mehrdad and Farhadi, Ali and Rastegari, Mohammad and Tuzel, Oncel}, title = {Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, }