# 3D-Speaker **Repository Path**: ruby11dog/3D-Speaker ## Basic Information - **Project Name**: 3D-Speaker - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-01 - **Last Updated**: 2025-08-01 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on [ModelScope](https://www.modelscope.cn/models?page=1&tasks=speaker-verification&type=audio). Furthermore, we present a large-scale speech corpus also called [3D-Speaker-Dataset](https://3dspeaker.github.io/) to facilitate the research of speech representation disentanglement.  Please support our community by starring it 感谢大家支持 ## Benchmark The EER results on VoxCeleb, CNCeleb and 3D-Speaker datasets for fully-supervised speaker verification. | Model | Params | VoxCeleb1-O | CNCeleb | 3D-Speaker | |:-----:|:------:| :------:|:------:|:------:| | Res2Net | 4.03 M | 1.56% | 7.96% | 8.03% | | ResNet34 | 6.34 M | 1.05% | 6.92% | 7.29% | | ECAPA-TDNN | 20.8 M | 0.86% | 8.01% | 8.87% | | ERes2Net-base | 6.61 M | 0.84% | 6.69% | 7.21% | | CAM++ | 7.2 M | 0.65% | 6.78% | 7.75% | | ERes2NetV2 | 17.8M | 0.61% | **6.14%** | 6.52% | | ERes2Net-large | 22.46 M | **0.52%** | 6.17% | **6.34%** | The DER results on public and internal multi-speaker datasets for speaker diarization. | Test | 3D-Speaker | [pyannote.audio](https://github.com/pyannote/pyannote-audio) | [DiariZen_WavLM](https://github.com/BUTSpeechFIT/DiariZen) | |:-----:|:------:|:------:|:------:| |[Aishell-4](https://arxiv.org/abs/2104.03603)|**10.30%**|12.2%|11.7%| |[Alimeeting](https://www.openslr.org/119/)|19.73%|24.4%|**17.6%**| |[AMI_SDM](https://groups.inf.ed.ac.uk/ami/corpus/)|21.76%|22.4%|**15.4%**| |[VoxConverse](https://github.com/joonson/voxconverse)|11.75%|**11.3%**|28.39%| |Meeting-CN_ZH-1|**18.91%**|22.37%|32.66%| |Meeting-CN_ZH-2|**12.78%**|17.86%|18%| ## Quickstart ### Install 3D-Speaker ``` sh git clone https://github.com/modelscope/3D-Speaker.git && cd 3D-Speaker conda create -n 3D-Speaker python=3.8 conda activate 3D-Speaker pip install -r requirements.txt ``` ### Running experiments ``` sh # Speaker verification: ERes2NetV2 on 3D-Speaker dataset cd egs/3dspeaker/sv-eres2netv2/ bash run.sh # Speaker verification: CAM++ on 3D-Speaker dataset cd egs/3dspeaker/sv-cam++/ bash run.sh # Speaker verification: ECAPA-TDNN on 3D-Speaker dataset cd egs/3dspeaker/sv-ecapa/ bash run.sh # Self-supervised speaker verification: SDPN on VoxCeleb dataset cd egs/voxceleb/sv-sdpn/ bash run.sh # Audio and multimodal Speaker diarization: cd egs/3dspeaker/speaker-diarization/ bash run_audio.sh bash run_video.sh # Language identification cd egs/3dspeaker/language-idenitfication bash run.sh ``` ### Inference using pretrained models from Modelscope All pretrained models are released on [Modelscope](https://www.modelscope.cn/models?page=1&tasks=speaker-verification&type=audio). ``` sh # Install modelscope pip install modelscope # ERes2Net trained on 200k labeled speakers model_id=iic/speech_eres2net_sv_zh-cn_16k-common # ERes2NetV2 trained on 200k labeled speakers model_id=iic/speech_eres2netv2_sv_zh-cn_16k-common # CAM++ trained on 200k labeled speakers model_id=iic/speech_campplus_sv_zh-cn_16k-common # Run CAM++ or ERes2Net inference python speakerlab/bin/infer_sv.py --model_id $model_id # Run batch inference python speakerlab/bin/infer_sv_batch.py --model_id $model_id --wavs $wav_list # SDPN trained on VoxCeleb model_id=iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k # Run SDPN inference python speakerlab/bin/infer_sv_ssl.py --model_id $model_id # Run diarization inference python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir # Enable overlap detection python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir --include_overlap --hf_access_token $hf_access_token ``` ## Overview of Content - **Supervised Speaker Verification** - [CAM++](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-cam%2B%2B), [ERes2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-eres2net), [ERes2NetV2](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-eres2netv2), [ECAPA-TDNN](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-ecapa), [ResNet](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-resnet) and [Res2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-res2net) training recipes on [3D-Speaker](https://3dspeaker.github.io/). - [CAM++](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-cam%2B%2B), [ERes2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-eres2net), [ERes2NetV2](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-eres2netv2), [ECAPA-TDNN](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-ecapa), [ResNet](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-resnet) and [Res2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-res2net) training recipes on [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/). - [CAM++](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-cam%2B%2B), [ERes2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-eres2net), [ERes2NetV2](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-eres2netv2), [ECAPA-TDNN](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-ecapa), [ResNet](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-resnet) and [Res2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-res2net) training recipes on [CN-Celeb](http://cnceleb.org/). - **Self-supervised Speaker Verification** - [RDINO](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-rdino) and [SDPN](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-sdpn) training recipes on [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) - [RDINO](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-rdino) training recipes on [3D-Speaker](https://3dspeaker.github.io/). - [RDINO](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-rdino) training recipes on [CN-Celeb](http://cnceleb.org/). - **Speaker Diarization** - [Speaker diarization](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/speaker-diarization) inference recipes which comprise multiple modules, including overlap detection[optional], voice activity detection, speech segmentation, speaker embedding extraction, and speaker clustering. - **Language Identification** - [Language identification](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/language-identification) training recipes on [3D-Speaker](https://3dspeaker.github.io/). - **3D-Speaker Dataset** - Dataset introduction and download address: [3D-Speaker](https://3dspeaker.github.io/)