# Shot2Story
**Repository Path**: ByteDance/Shot2Story
## Basic Information
- **Project Name**: Shot2Story
- **Description**: A new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-12-07
- **Last Updated**: 2026-03-12
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
We are excited to release a new video-text benchmark and extendable codes for multi-shot video understanding. Our updated **134k** version of dataset includes detailed long summaries for **134k videos and shot captions for 188k video shots**.
**Stay tuned for more exciting data release and new features!**
---
## What's new 👀
🌟🚀 Our paper is accepted by ICLR 2025. See you in Singapore.
🌟 Update (25/09/2024): Our online demo is back in service. If the demo fails again, please feel free to report the issue here or on Hugging Face.
🌟 Update (10/06/2024): Please check the release question-answering benchmark [here](https://huggingface.co/datasets/mhan/shot2story/viewer/question-answering). It is designed to benchmark models for multi-shot understanding, w.r.t. temporal-related, holistic-understanding and audio-related aspects.
🌟 Update (05/06/2024): Please check the cached multi-shot videos on [OneDrive](https://1drv.ms/f/s!Ap3OKt6-X52NgXoG4-64N9WZDenS?e=oIHfkZ) or [HF](https://huggingface.co/mhan/shot2story-videos). It takes around 160GB of disk space and needs to extract video shots on your own.
🌟 Update (29/04/2024): Please check the issue [here](https://github.com/bytedance/Shot2Story/issues/5) for 134k-version video download assistance. Thanks for the support from the community.
🌟 Update (24/04/2024): We release a new [134K version](https://huggingface.co/datasets/mhan/shot2story).
- It has detailed video text summaries by (43K) human annotation and (90K) GPTV generation, covering over 548k video shots.
- Val/Test split in different tasks are remained same to 20K version. [Online ChatBot ](https://huggingface.co/spaces/mhan/Shot2Story) has been updated. 🎥📝🚀
- Video textual summary generation demo ([SumBot](https://af4cf4e4ce320a8d6c.gradio.live)) is also online. Have a try to generate detailed description for your video! 🎥📝
🌟 Update (23/04/2024): Please check the issue [here](https://github.com/bytedance/Shot2Story/issues/5) for 20k-version video download assistance. Thanks for the support from the community.
🌟 Update (16/12/2023): [Paper](https://arxiv.org/pdf/2312.10300) and [Demo](https://huggingface.co/spaces/mhan/Shot2Story) for SUM-shot model. It showcases the power and versatility of detailed and grounded video summaries. Dive into the demo and share your experiences with us! **Chat-SUM-shot** is on the way! Stay tuned!🎥📝🚀
🌟 Update (12/12/2023): [Code](https://github.com/bytedance/Shot2Story/tree/master/code) for video summarization and shot captioning, in the sub-directory `code` of this repo. Dive into these new features and share your experiences with us! 🎥📝🚀
🌟 Update (30/11/2023): [Data](https://github.com/bytedance/Shot2Story/blob/master/DATA.md) of [Shot2Story-20K](https://huggingface.co/datasets/mhan/Shot2Story-20K). Check them out and stay tuned for more exciting updates! 💫🚀
---
## Demo
We build a [ChatBot](https://huggingface.co/spaces/mhan/Shot2Story) demo and a [SUMBot](https://huggingface.co/spaces/mhan/Shot2Story-SUM) demo for SUM-shot model. Please have a look and explore what it is capable of. Issues are welcomed!
Some hints to play with our demo:
* 🎉 Start with our provided **demo videos**, some of which are sampled from ActivityNet, not included in our training data.
* 🚀 Please upload videos **less than 20MB**. Enjoy!
* 😄 For a more comprehensive understanding, try specifying reasonable starting and ending timestamps for the shots. Enjoy!
* 😄 Setting temperature to 0.1 for the most grounded understanding and question-answering.
* 😄 Setting temperature to greater value for the creative grounded understanding and question-answering.
Multi-round conversation analyzing a humorous video:
https://github.com/bytedance/Shot2Story/assets/18671115/3f86cf2b-ff59-47e6-8d7b-c1273c8ea5bc
Multiple-step minutes-long video analysis:
https://github.com/bytedance/Shot2Story/assets/18671115/42b5c81d-2a49-4668-a2fd-08fe26aeb9fd
---
## Table of Contents
1. [🌟 What's new 👀](#news)
2. [Demo](#demo)
3. [Introduction](#introduction)
4. [Dataset Glance](#dataset-glance)
5. [Baselines and Tasks](#baselines)
6. [License](#license)
7. [Citation](#citation)
8. [Contact](#contact)
---
## Introduction
A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary.
--- ## Baselines and Tasks To benchmark the advances of multi-modal video understanding, we designed several distinctive tasks using our dataset, including single-shot captioning, multi-shot summarization, and video retrieval with shot description. We design and implemented several baseline models using a frozen vision encoder and an LLM, by prompting the LLM with frame tokens and ASR (Automatic Speech Recognition) text. [Code](./code/README.md) here for running the project.
--- ## License Our code is licensed under a [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.txt). Our text annotations are released under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License](https://creativecommons.org/licenses/by-nc-sa/4.0/). They are available strictly for non-commercial research. More guidelines of dataset can be found in [here](./DATA.md#license). --- ## Citation If you find this repo useful for your research, please consider citing the paper ``` @article{han2023shot2story20k, title={Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos}, author={Mingfei Han and Linjie Yang and Xiaojun Chang and Heng Wang}, journal={arXiv preprint arXiv:2311.17043}, year={2023} } ``` --- ## Contact If you have any questions or concerns about our dataset, please don't hesitate to contact us. You can raise an issue or reach us at hanmingfei@bytedance.com. We welcome feedback and are always looking to improve our dataset. --- We extend our thanks to the teams behind [HD-VILA-100M](https://github.com/microsoft/XPretrain/blob/main/hd-vila-100m/README.md), [BLIP2](https://github.com/salesforce/LAVIS), [Whisper](https://github.com/openai/whisper), [MiniGPT-4](https://minigpt-4.github.io/), [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) and [LLaMA](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/). Our work builds upon their valuable contributions. Please acknowledge these resources in your work.