# VideoX **Repository Path**: wanyanhw/VideoX ## Basic Information - **Project Name**: VideoX - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-04 - **Last Updated**: 2025-11-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # VideoX - Multi-modal Video Content Understanding [](https://huggingface.co/models?other=xclip) [](https://twitter.com/intent/tweet?text=A%20new%20collection%20of%20video%20cross-modal%20models.&url=https://github.com/microsoft/VideoX&via=houwen_peng&hashtags=Video,CLIP,Video_Text) ***This is a collection of our video understanding work*** > [**SeqTrack**](./SeqTrack) (```@CVPR'23```): **SeqTrack: Sequence to Sequence Learning for Visual Object Tracking** > [**X-CLIP**](./X-CLIP) (```@ECCV'22 Oral```): **Expanding Language-Image Pretrained Models for General Video Recognition** > [**MS-2D-TAN**](./MS-2D-TAN) (```@TPAMI'21```): **Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with Natural Language** > [**2D-TAN**](./2D-TAN) (```@AAAI'20```): **Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language** ## News - :sunny: Hiring research interns with strong coding skills: henry.hw.peng@gmail.com | penghouwen@icloud.com - :boom: Apr, 2023: Code for [**SeqTrack**](./SeqTrack) is now released. - :boom: Feb, 2023: [**SeqTrack**](./SeqTrack) was accepted to CVPR'23 - :boom: Sep, 2022: [**X-CLIP**](./X-CLIP) is now integrated into [](https://huggingface.co/models?other=xclip) - :boom: Aug, 2022: Code for [**X-CLIP**](./X-CLIP) is now released. - :boom: Jul, 2022: [**X-CLIP**](./X-CLIP) was accepted to ECCV'22 as Oral - :boom: Oct, 2021: Code for [**MS-2D-TAN**](./MS-2D-TAN) is now released. - :boom: Sep, 2021: [**MS-2D-TAN**](./MS-2D-TAN) was accepted to TPAMI'21 - :boom: Dec, 2019: Code for [**2D-TAN**](./2D-TAN) is now released. - :boom: Nov, 2019: [**2D-TAN**](./2D-TAN) was accepted to AAAI'20 ## Works ### [SeqTrack](./SeqTrack) In this paper, we propose a new sequence-to-sequence learning framework for visual tracking, dubbed SeqTrack. It casts visual tracking as a sequence generation problem, which predicts object bounding boxes in an autoregressive fashion. SeqTrack only adopts a simple encoder-decoder transformer architecture. The encoder extracts visual features with a bidirectional transformer, while the decoder generates a sequence of bounding box values autoregressively with a causal decoder. The loss function is a plain cross-entropy. Such a sequence learning paradigm not only simplifies tracking framework, but also achieves competitive performance on many benchmarks.