# funasr-api
**Repository Path**: i-android/funasr-api
## Basic Information
- **Project Name**: funasr-api
- **Description**: funasr-api
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-27
- **Last Updated**: 2026-02-09
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
Ready-to-use Local Speech Recognition API Service
Speech recognition API service powered by [FunASR](https://github.com/alibaba-damo-academy/FunASR) and [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR), supporting 52 languages, compatible with OpenAI API and Alibaba Cloud Speech API.
[简体中文](./docs/README_zh.md)
---



## Demo
[](https://media.cdn.vect.one/qwenasr_client_demo.mp4)
## Features
- **Multi-Model Support** - Integrates [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR) 1.7B/0.6B and Paraformer Large ASR models
- **Speaker Diarization** - Automatic multi-speaker identification using CAM++ model
- **OpenAI API Compatible** - Supports `/v1/audio/transcriptions` endpoint, works with OpenAI SDK
- **Alibaba Cloud API Compatible** - Supports Alibaba Cloud Speech RESTful API and WebSocket streaming protocol
- **WebSocket Streaming** - Real-time streaming speech recognition with low latency
- **Smart Far-Field Filtering** - Automatically filters far-field sounds and ambient noise in streaming ASR
- **Intelligent Audio Segmentation** - VAD-based greedy merge algorithm for automatic long audio splitting
- **GPU Batch Processing** - Batch inference support, 2-3x faster than sequential processing
- **Flexible Configuration** - Environment variable based configuration, load models on demand
## Quick Deployment
### 1. Docker Deployment (Recommended)
```bash
# Copy and edit configuration
cp .env.example .env
# Edit .env to set ENABLED_MODELS and API_KEY (optional)
# Start service (GPU version)
docker-compose up -d
# Or CPU version
docker-compose -f docker-compose-cpu.yml up -d
```
Service URLs:
- **API Endpoint**: `http://localhost:17003`
- **API Docs**: `http://localhost:17003/docs`
**docker run (alternative):**
```bash
# GPU version
docker run -d --name funasr-api \
--gpus all \
-p 17003:8000 \
-e ENABLED_MODELS=auto \
-e API_KEY=your_api_key \
-v ./models/modelscope:/root/.cache/modelscope \
-v ./models/huggingface:/root/.cache/huggingface \
quantatrisk/funasr-api:gpu-latest
# CPU version
docker run -d --name funasr-api \
-p 17003:8000 \
-e ENABLED_MODELS=paraformer-large \
quantatrisk/funasr-api:cpu-latest
```
> **Note**: CPU environment automatically filters Qwen3 models (vLLM requires GPU)
**Offline Deployment**: Pack and copy the `models/` directory to the offline machine. See [MODEL_SETUP.md](./docs/MODEL_SETUP.md) for details.
> Detailed deployment instructions: [Deployment Guide](./docs/deployment.md)
### Local Development
**System Requirements:**
- Python 3.10+
- CUDA 12.1+ (optional, for GPU acceleration)
- FFmpeg (audio format conversion)
**Installation:**
```bash
# Clone project
cd FunASR-API
# Install dependencies
pip install -r requirements.txt
# Start service
python start.py
```
## API Endpoints
### OpenAI Compatible API
| Endpoint | Method | Function |
|----------|--------|----------|
| `/v1/audio/transcriptions` | POST | Audio transcription (OpenAI compatible) |
| `/v1/models` | GET | Model list |
**Request Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `file` | file | Mutually exclusive with `audio_address` | Audio file |
| `audio_address` | string | Mutually exclusive with `file` | Audio file URL (HTTP/HTTPS) |
| `model` | string | auto-detect | Model selection (qwen3-asr-1.7b, qwen3-asr-0.6b, paraformer-large) |
| `language` | string | Auto-detect | Language code (zh/en/ja) |
| `enable_speaker_diarization` | bool | `true` | Enable speaker diarization |
| `word_timestamps` | bool | `true` | Return word-level timestamps (Qwen3-ASR only) |
| `response_format` | string | `verbose_json` | Output format |
| `prompt` | string | - | Prompt text (reserved) |
| `temperature` | float | `0` | Sampling temperature (reserved) |
**Audio Input Methods:**
- **File Upload**: Use `file` parameter to upload audio file (standard OpenAI way)
- **URL Download**: Use `audio_address` parameter to provide audio URL, service will download automatically
**Usage Examples:**
```python
# Using OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your_api_key")
with open("audio.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1", # Maps to default model
file=f,
response_format="verbose_json" # Get segments and speaker info
)
print(transcript.text)
```
```bash
# Using curl
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
-H "Authorization: Bearer your_api_key" \
-F "file=@audio.wav" \
-F "model=paraformer-large" \
-F "response_format=verbose_json" \
-F "enable_speaker_diarization=true"
```
**Supported Response Formats:** `json`, `text`, `srt`, `vtt`, `verbose_json`
### Alibaba Cloud Compatible API
| Endpoint | Method | Function |
|----------|--------|----------|
| `/stream/v1/asr` | POST | Speech recognition (long audio support) |
| `/stream/v1/asr/models` | GET | Model list |
| `/stream/v1/asr/health` | GET | Health check |
| `/ws/v1/asr` | WebSocket | Streaming ASR (Alibaba Cloud protocol compatible) |
| `/ws/v1/asr/funasr` | WebSocket | FunASR streaming (backward compatible) |
| `/ws/v1/asr/qwen` | WebSocket | Qwen3-ASR streaming |
| `/ws/v1/asr/test` | GET | WebSocket test page |
**Request Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_id` | string | auto-detect | Model ID |
| `audio_address` | string | - | Audio URL (optional) |
| `sample_rate` | int | `16000` | Sample rate |
| `enable_speaker_diarization` | bool | `true` | Enable speaker diarization |
| `word_timestamps` | bool | `false` | Return word-level timestamps (Qwen3-ASR only) |
| `vocabulary_id` | string | - | Hotwords (format: `word1 weight1 word2 weight2`) |
**Usage Examples:**
```bash
# Basic usage
curl -X POST "http://localhost:8000/stream/v1/asr" \
-H "Content-Type: application/octet-stream" \
--data-binary @audio.wav
# With parameters
curl -X POST "http://localhost:8000/stream/v1/asr?enable_speaker_diarization=true" \
-H "Content-Type: application/octet-stream" \
--data-binary @audio.wav
```
**Response Example:**
```json
{
"task_id": "xxx",
"status": 200,
"message": "SUCCESS",
"result": "Speaker1 content...\nSpeaker2 content...",
"duration": 60.5,
"processing_time": 1.234,
"segments": [
{
"text": "Today is a nice day.",
"start_time": 0.0,
"end_time": 2.5,
"speaker_id": "Speaker1",
"word_tokens": [
{"text": "Today", "start_time": 0.0, "end_time": 0.5},
{"text": "is", "start_time": 0.5, "end_time": 0.7},
{"text": "a nice day", "start_time": 0.7, "end_time": 1.5}
]
}
]
}
```
**WebSocket Streaming Test:** Visit `http://localhost:8000/ws/v1/asr/test`
## Speaker Diarization
Multi-speaker automatic identification based on CAM++ model:
- **Enabled by Default** - `enable_speaker_diarization=true`
- **Automatic Detection** - No preset speaker count needed, model auto-detects
- **Speaker Labels** - Response includes `speaker_id` field (e.g., "Speaker1", "Speaker2")
- **Smart Merging** - Two-layer merge strategy to avoid isolated short segments:
- Layer 1: Accumulate merge same-speaker segments < 10 seconds
- Layer 2: Accumulate merge continuous segments up to 60 seconds
- **Subtitle Support** - SRT/VTT output includes speaker labels `[Speaker1] text content`
Disable speaker diarization:
```bash
# OpenAI API
-F "enable_speaker_diarization=false"
# Alibaba Cloud API
?enable_speaker_diarization=false
```
## Audio Processing
### Intelligent Segmentation Strategy
Automatic long audio segmentation:
1. **VAD Voice Detection** - Detect voice boundaries, filter silence
2. **Greedy Merge** - Accumulate voice segments, ensure each segment does not exceed `MAX_SEGMENT_SEC` (default 90s)
3. **Silence Split** - Force split when silence between voice segments exceeds 3 seconds
4. **Batch Inference** - Multi-segment parallel processing, 2-3x performance improvement in GPU mode
### WebSocket Streaming Limitations
**FunASR Model Limitations** (using `/ws/v1/asr` or `/ws/v1/asr/funasr`):
- ✅ Real-time speech recognition, low latency
- ✅ Sentence-level timestamps
- ❌ **Word-level timestamps** (not implemented)
- ❌ **Confidence scores** (not implemented)
**Qwen3-ASR Streaming** (using `/ws/v1/asr/qwen`):
- ✅ Word-level timestamps
- ✅ Multi-language real-time recognition
## Supported Models
| Model ID | Name | Description | Features |
|----------|------|-------------|----------|
| `qwen3-asr-1.7b` | Qwen3-ASR 1.7B | High-performance multilingual ASR, 52 languages + dialects, vLLM backend | Offline/Realtime |
| `qwen3-asr-0.6b` | Qwen3-ASR 0.6B | Lightweight multilingual ASR, suitable for low VRAM environments | Offline/Realtime |
| `paraformer-large` | Paraformer Large | High-precision Chinese speech recognition | Offline/Realtime |
**Model Selection:**
Use `ENABLED_MODELS` environment variable to control which models to load:
```bash
# Options: auto, all, or comma-separated list
ENABLED_MODELS=auto # Auto-detect GPU and load appropriate models
ENABLED_MODELS=all # Load all available models
ENABLED_MODELS=paraformer-large # Only Paraformer
ENABLED_MODELS=qwen3-asr-0.6b # Only Qwen3 0.6B
ENABLED_MODELS=paraformer-large,qwen3-asr-0.6b # Both
```
**Auto mode behavior:**
- **VRAM >= 32GB**: Auto-load `qwen3-asr-1.7b` + `paraformer-large`
- **VRAM < 32GB**: Auto-load `qwen3-asr-0.6b` + `paraformer-large`
- **No CUDA**: Only `paraformer-large` (Qwen3 requires vLLM/GPU)
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `ENABLED_MODELS` | `auto` | Models to load: `auto`, `all`, or comma-separated list |
| `API_KEY` | - | API authentication key (optional, unauthenticated if not set) |
| `LOG_LEVEL` | `INFO` | Log level (DEBUG/INFO/WARNING/ERROR) |
| `MAX_AUDIO_SIZE` | `2048` | Max audio file size (MB, supports units like 2GB) |
| `ASR_BATCH_SIZE` | `4` | ASR batch size (GPU: 4, CPU: 2) |
| `MAX_SEGMENT_SEC` | `90` | Max audio segment duration (seconds) |
| `ENABLE_STREAMING_VLLM` | `false` | Load streaming VLLM instance (saves VRAM) |
| `MODELSCOPE_PATH` | `~/.cache/modelscope/hub/models` | ModelScope cache path |
| `HF_HOME` | `~/.cache/huggingface` | HuggingFace cache path (GPU mode) |
| `ASR_ENABLE_LM` | `true` | Enable language model (Paraformer) |
| `ASR_ENABLE_NEARFIELD_FILTER` | `true` | Enable far-field sound filtering |
> Detailed configuration: [Near-Field Filter Docs](./docs/nearfield_filter.md)
## Resource Requirements
**Minimum (CPU):**
- CPU: 4 cores
- Memory: 16GB
- Disk: 20GB
**Recommended (GPU):**
- CPU: 4 cores
- Memory: 16GB
- GPU: NVIDIA GPU (16GB+ VRAM)
- Disk: 20GB
## API Documentation
After starting the service:
- Swagger UI: `http://localhost:8000/docs`
- ReDoc: `http://localhost:8000/redoc`
## Links
- **Deployment Guide**: [Detailed Docs](./docs/deployment.md)
- **Near-Field Filter Config**: [Config Guide](./docs/nearfield_filter.md)
- **FunASR**: [FunASR GitHub](https://github.com/alibaba-damo-academy/FunASR)
- **Chinese README**: [中文文档](./docs/README_zh.md)
## License
This project uses the MIT License - see [LICENSE](LICENSE) file for details.
## Star History
[](https://star-history.com/#Quantatirsk/funasr-api&Date)
## Contributing
Issues and Pull Requests are welcome to improve the project!