Qwen3 ASR
Qwen3-ASR is an advanced open-source speech recognition model 1.7B parameter models. It supports both streaming and offline inference across 52 languages and dialects, including robust handling of singing and background noise. Built on the Qwen3-Omni architecture, it integrates a specialized audio encoder with a powerful language decoder to achieve industry-leading accuracy. Beyond simple transcription, it includes the Qwen3-ForcedAligner for precise, multi-level timestamping.
Features
Multimodal Chat API
Transcribe via POST /v1/chat/completions by embedding audio as a base64 data URI in the messages array. Supports WAV, FLAC, MP3, OGG, M4A, Opus, and WebM.
DocsWhisper-Compatible Endpoint
Also available via POST /v1/audio/transcriptions as a multipart form upload — more efficient for large files, returns clean text with no parsing required.
Docs