# lighthouse **Repository Path**: mirrors_line/lighthouse ## Basic Information - **Project Name**: lighthouse - **Description**: [EMNLP2024 Demo], [ICASSP 2025] A user-friendly library for reproducible video moment retrieval and highlight detection. It also supports audio moment retrieval. - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-07-20 - **Last Updated**: 2026-04-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Lighthouse ![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg) [![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0) [![Video moment retrieval demo](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/awkrail/lighthouse_demo) [![Audio moment retrieval demo](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/lighthouse-emnlp2024/AudioMomentRetrieval) [![Run pytest](https://github.com/line/lighthouse/actions/workflows/pytest.yml/badge.svg)](https://github.com/line/lighthouse/actions/workflows/pytest.yml) [![Run mypy and ruff](https://github.com/line/lighthouse/actions/workflows/mypy_ruff.yml/badge.svg)](https://github.com/line/lighthouse/actions/workflows/mypy_ruff.yml) Lighthouse is a user-friendly library for reproducible video moment retrieval and highlight detection (MR-HD). It supports seven models, four features (video and audio features), and six datasets for reproducible MR-HD, MR, and HD. In addition, we prepare an inference API and Gradio demo for developers to use state-of-the-art MR-HD approaches easily. Furthermore, Lighthouse supports [audio moment retrieval](https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval/), a task to identify relevant moments from an audio input based on a given text query. ## News - [2026/01/18] Our work ["CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries "](https://arxiv.org/abs/2511.15131) has been accepted at ICASSP 2026. - [2025/11/20] [Version 1.2](https://github.com/line/lighthouse/releases/tag/v1.2) Our work ["CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries"](https://arxiv.org/abs/2511.15131) has been released. This update adds support for a new AMR dataset called CASTELLA. - [2025/06/04] [Version 1.1](https://github.com/line/lighthouse/releases/tag/v1.1) has been released. It includes API changes, AMR gradio demo, and huggingface wrappers for the audio moment retrieval and clotho dataset. - [2024/12/24] Our work ["Language-based audio moment retrieval"](https://arxiv.org/abs/2409.15672) has been accepted at ICASSP 2025. - [2024/10/22] [Version 1.0](https://github.com/line/lighthouse/releases/tag/v1.0) has been released. - [2024/10/6] Our paper has been accepted at EMNLP2024, system demonstration track. - [2024/09/25] Our work ["Language-based audio moment retrieval"](https://arxiv.org/abs/2409.15672) has been released. Lighthouse supports AMR. - [2024/08/22] Our demo paper is available on arXiv. Any comments are welcome: [Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection](https://www.arxiv.org/abs/2408.02901). ## Installation Install ffmpeg first. If you are an Ubuntu user, run: ``` apt install ffmpeg ``` Then, install pytorch, torchvision, and torchaudio based on your GPU environments. Note that the inference API is available for CPU environments. We tested the codes on Python 3.9 and CUDA 11.8: ``` pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118 ``` Finally, run to install dependency libraries: ``` pip install 'git+https://github.com/line/lighthouse.git' ``` ## Inference API (Available for both CPU/GPU mode) Lighthouse supports the following inference API: ```python import torch from lighthouse.models import CGDETRPredictor # use GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" # slowfast_path is necesary if you use clip_slowfast features query = 'A man is speaking in front of the camera' model = CGDETRPredictor('/path/to/weight.ckpt', device=device, feature_name='clip_slowfast', slowfast_path='SLOWFAST_8x8_R50.pkl') # encode video features video = model.encode_video('api_example/RoripwjYFp8_60.0_210.0.mp4') # moment retrieval & highlight detection prediction = model.predict(query, video) print(prediction) """ pred_relevant_windows: [[start, end, score], ...,] pred_saliency_scores: [score, ...] {'query': 'A man is speaking in front of the camera', 'pred_relevant_windows': [[117.1296, 149.4698, 0.9993], [-0.1683, 5.4323, 0.9631], [13.3151, 23.42, 0.8129], ...], 'pred_saliency_scores': [-10.868017196655273, -12.097496032714844, -12.483806610107422, ...]} """ ``` Lighthouse also supports the AMR inference API: ```python import torch from lighthouse.models import QDDETRPredictor device = "cuda" if torch.cuda.is_available() else "cpu" model = QDDETRPredictor('/path/to/weight.ckpt', device=device, feature_name='clap') audio = model.encode_audio('api_example/1a-ODBWMUAE.wav') query = 'Water cascades down from a waterfall.' prediction = model.predict(query, audio) print(prediction) ``` Run `python api_example/demo.py` (MR-HD) or `python api_example/amr_demo.py` (AMR) to reproduce the results. It automatically downloads pre-trained weights. If you want to use other models, download [pre-trained weights](https://drive.google.com/file/d/1jxs_bvwttXTF9Lk3aKLohkqfYOonLyrO/view?usp=sharing). When using `clip_slowfast` features, it is necessary to download [slowfast pre-trained weights](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_8x8_R50.pkl). When using `clip_slowfast_pann` features, in addition to the slowfast weight, download [panns weights](https://zenodo.org/record/3987831/files/Cnn14_mAP%3D0.431.pth). Run `python api_example/amr_demo.py` to reproduce the AMR results. **Limitation**: The maximum video duration is **150s** due to the current benchmark datasets. For CPU users, set `feature_name='clip'` because CLIP+Slowfast or CLIP+Slowfast+PANNs features are very slow without GPUs. ## Gradio demo Run `python gradio_demo/demo.py`. Upload the video and input text query, and click the blue button. For AMR demo, run `python gradio_demo/amr_demo.py`. MR-HD demo ![Gradio demo image](images/vmr_demo.png) AMR demo ![Amr demo image](images/amr_demo.png) ## Supported models, datasets, and features ### Models Moment retrieval & highlight detection - [x] : [Moment-DETR (Lei et al. NeurIPS21)](https://arxiv.org/abs/2107.09609) - [x] : [QD-DETR (Moon et al. CVPR23)](https://arxiv.org/abs/2303.13874) - [x] : [EaTR (Jang et al. ICCV23)](https://arxiv.org/abs/2308.06947) - [x] : [CG-DETR (Moon et al. arXiv24)](https://arxiv.org/abs/2311.08835) - [x] : [UVCOM (Xiao et al. CVPR24)](https://arxiv.org/abs/2311.16464) - [x] : [TR-DETR (Sun et al. AAAI24)](https://arxiv.org/abs/2401.02309) - [x] : [TaskWeave (Jin et al. CVPR24)](https://arxiv.org/abs/2404.09263) - [ ] : [R2-Tuning (Liu et al. ECCV24)](https://arxiv.org/abs/2404.00801) ### Datasets Moment retrieval & highlight detection - [x] : [QVHighlights (Lei et al. NeurIPS21)](https://arxiv.org/abs/2107.09609) - [x] : [QVHighlights w/ Audio Features (Lei et al. NeurIPS21)](https://arxiv.org/abs/2107.09609) - [x] : [QVHighlights ASR Pretraining (Lei et al. NeurIPS21)](https://arxiv.org/abs/2107.09609) Moment retrieval - [x] : [ActivityNet Captions (Krishna et al. ICCV17)](https://arxiv.org/abs/1705.00754) - [x] : [Charades-STA (Gao et al. ICCV17)](https://arxiv.org/abs/1705.02101) - [x] : [TaCoS (Regneri et al. TACL13)](https://aclanthology.org/Q13-1003/) Highlight detection - [x] : [TVSum (Song et al. CVPR15)](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Song_TVSum_Summarizing_Web_2015_CVPR_paper.pdf) - [x] : [YouTube Highlights (Sun et al. ECCV14)](https://grail.cs.washington.edu/wp-content/uploads/2015/08/sun2014rdh.pdf) Audio moment retrieval - [x] : [Clotho Moment/TUT2017/UnAV100-subset (Munakata et al. arXiv24)](https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval/) ### Features - [x] : ResNet+GloVe - [x] : CLIP - [x] : CLIP+Slowfast - [x] : CLIP+Slowfast+PANNs (Audio) for QVHighlights - [x] : I3D+CLIP (Text) for TVSum ## Reproduce the experiments ### Pre-trained weights Pre-trained weights can be downloaded from [here](https://drive.google.com/file/d/1jxs_bvwttXTF9Lk3aKLohkqfYOonLyrO/view?usp=sharing). Download and unzip on the home directory. AMR models trained on CASTELLA and Clotho-Moment is available in [here](https://zenodo.org/uploads/17422909) ### Datasets Due to the copyright issue, we here distribute only feature files. Download and place them under `./features` directory. To extract features from videos, we use [HERO_Video_Feature_Extractor](https://github.com/linjieli222/HERO_Video_Feature_Extractor). - [QVHighlights](https://drive.google.com/file/d/1-ALnsXkA4csKh71sRndMwybxEDqa-dM4/view?usp=sharing) - [Charades-STA](https://drive.google.com/file/d/1EOeP2A4IMYdotbTlTqDbv5VdvEAgQJl8/view?usp=sharing) - [ActivityNet Captions](https://drive.google.com/file/d/1P2xS998XfbN5nSDeJLBF1m9AaVhipBva/view?usp=sharing) - [TACoS](https://drive.google.com/file/d/1rYzme9JNAk3niH1K81wgT13pOMn005jb/view?usp=sharing) - [TVSum](https://drive.google.com/file/d/1gSex1hpXLxHQu6zHyyQISKZjP7Ndt6U9/view?usp=sharing) - [YouTube Highlight](https://drive.google.com/file/d/12swoymGwuN5TlDlWBTo6UUWVm2DqVBpn/view?usp=sharing) For [AMR](https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval/), download features from here. - [Clotho Moment/TUT2017/UnAV100-subset](https://zenodo.org/records/13806234) - [CASTELLA](https://zenodo.org/records/17412176) [[Mirror on HF]](https://huggingface.co/datasets/lighthouse-emnlp2024/CASTELLA_CLAP_features) The whole directory should be look like this: ``` lighthouse/ ├── api_example ├── configs ├── data ├── features # Download the features and place them here │ ├── ActivityNet │ │ ├── clip │ │ ├── clip_text │ │ ├── resnet │ │ └── slowfast │ ├── Charades │ │ ├── clip │ │ ├── clip_text │ │ ├── resnet │ │ ├── slowfast │ ├── QVHighlight │ │ ├── clip │ │ ├── clip_text │ │ ├── pann │ │ ├── resnet │ │ └── slowfast │ ├── tacos │ │ ├── clip │ │ ├── clip_text │ │ ├── resnet │ │ └── slowfast │ ├── tvsum │ │ ├── clip │ │ ├── clip_text │ │ ├── i3d │ │ ├── resnet │ │ ├── slowfast │ ├── youtube_highlight │ │ ├── clip │ │ ├── clip_text │ │ └── slowfast │ └── clotho-moments │ ├── clap │ └── clap_text ├── gradio_demo ├── images ├── lighthouse ├── results # The pre-trained weights are saved in this directory └── training ``` ### Training and evaluation #### Training The training command is: ``` python training/train.py --model MODEL --dataset DATASET --feature FEATURE [--resume RESUME] [--domain DOMAIN] ``` | | Options | |---------|----------------------------------------------------------------------------------------------------------| | Model | moment_detr, qd_detr, eatr, cg_detr, uvcom, tr_detr, taskweave_mr2hd, taskweave_hd2mr | | Feature | resnet_glove, clip, clip_slowfast, clip_slowfast_pann, i3d_clip, clap | | Dataset | qvhighlight, qvhighlight_pretrain, activitynet, charades, tacos, tvsum, youtube_highlight, clotho-moment | (**Example 1**) Moment DETR w/ CLIP+Slowfast on QVHighlights: ``` python training/train.py --model moment_detr --dataset qvhighlight --feature clip_slowfast ``` (**Example 2**) Moment DETR w/ CLIP+Slowfast+PANNs (Audio) on QVHighlights: ``` python training/train.py --model moment_detr --dataset qvhighlight --feature clip_slowfast_pann ``` (**Pre-train & Fine-tuning, QVHighlights only**) Lighthouse supports pre-training. Run: ``` python training/train.py --model moment_detr --dataset qvhighlight_pretrain --feature clip_slowfast ``` Then fine-tune the model with `--resume` option: ``` python training/train.py --model moment_detr --dataset qvhighlight --feature clip_slowfast --resume results/moment_detr/qvhighlight_pretrain/clip_slowfast/best.ckpt ``` (**TVSum and YouTube Highlight**) To train models on these two datasets, you need to specify domain: ``` python training/train.py --model moment_detr --dataset tvsum --feature clip_slowfast --domain BK ``` #### Evaluation The evaluation command is: ``` python training/evaluate.py --model MODEL --dataset DATASET --feature FEATURE --split {val,test} --model_path MODEL_PATH --eval_path EVAL_PATH [--domain DOMAIN] ``` (**Example 1**) Evaluating Moment DETR w/ CLIP+Slowfast on the QVHighlights val set: ``` python training/evaluate.py --model moment_detr --dataset qvhighlight --feature clip_slowfast --split val --model_path results/moment_detr/qvhighlight/clip_slowfast/best.ckpt --eval_path data/qvhighlight/highlight_val_release.jsonl ``` To generate submission files for QVHighlight test sets, change split into test (**QVHighlights only**): ``` python training/evaluate.py --model moment_detr --dataset qvhighlight --feature clip_slowfast --split test --model_path results/moment_detr/qvhighlight/clip_slowfast/best.ckpt --eval_path data/qvhighlight/highlight_test_release.jsonl ``` Then zip `hl_val_submission.jsonl` and `hl_test_submission.jsonl`, and submit it to the [Codalab](https://codalab.lisn.upsaclay.fr/competitions/6937) (**QVHighlights only**): ``` zip -r submission.zip val_submission.jsonl test_submission.jsonl ``` ## HuggingFace Wrapper We support [wrappers for HuggingFace](https://huggingface.co/lighthouse-emnlp2024). You can easily use models and dataset via `AutoModel` and `huggingface_hub`. The following models and datasets are provided by the wrapper for HuggingFace. ### Models - [Audio Moment DETR (Munakata et al. ICASSP2024)](https://huggingface.co/lighthouse-emnlp2024/AM-DETR) ### Datasets - [Clotho Moment (Munakata et al. ICASSP2024)](https://huggingface.co/datasets/lighthouse-emnlp2024/Clotho-Moment) ## Citation Lighthouse ```bibtex @InProceedings{taichi2024emnlp, author = {Taichi Nishimura and Shota Nakada and Hokuto Munakata and Tatsuya Komatsu}, title = {Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection}, booktitle = {Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations}, year = {2024}, } ``` Audio moment retrieval ```bibtex @InProceedings{hokuto2025icassp, author = {Hokuto Munakata and Taichi Nishimura and Shota Nakada and Tatsuya Komatsu}, title = {Language-based Audio Moment Retrieval}, booktitle = {IEEE International Conference on Acoustic, Speech, and Signal Processing}, year = {2025}, } ``` ## Contributing Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. ## LICENSE Apache License 2.0 ## Contact Taichi Nishimura ([taichitary@gmail.com](taichitary@gmail.com))