Qwen3 ASR

Qwen3-ASR is an advanced open-source speech recognition model 1.7B parameter models. It supports both streaming and offline inference across 52 languages and dialects, including robust handling of singing and background noise. Built on the Qwen3-Omni architecture, it integrates a specialized audio encoder with a powerful language decoder to achieve industry-leading accuracy. Beyond simple transcription, it includes the Qwen3-ForcedAligner for precise, multi-level timestamping.

Features

Multimodal Chat API

Transcribe via POST /v1/chat/completions by embedding audio as a base64 data URI in the messages array. Supports WAV, FLAC, MP3, OGG, M4A, Opus, and WebM.

Docs

Whisper-Compatible Endpoint

Also available via POST /v1/audio/transcriptions as a multipart form upload — more efficient for large files, returns clean text with no parsing required.

Docs
MiniMax  M2.5
Kimi K2.5
GLM 5
DeepSeek V3.2
gpt-oss-120b
gpt-oss-20b
Qwen3 Instruct
Qwen3 Thinking
Qwen3 Coder
Qwen3.5
Qwen3 VL Instruct
Qwen3 ASR
Qwen-Image
Qwen-Image-Edit
Flux2
Stable Diffusion 3.5
Hunyuan Image
Z-Image
Wan2.2-I2V
Wan2.2-T2V
Hunyuan Image
Z-Image