LongCat-Audio-Codec
Audio tokenizer and detokenizer for speech large language models
Overview
LongCat-Audio-Codec is an audio processing module providing low-bitrate, real-time streaming audio tokenization and detokenization for speech LLMs. It converts raw audio signals into parallel semantic and acoustic token sequences, enabling efficient audio encoding and decoding for high-fidelity reconstruction at extremely low bitrates (0.43–0.87 kbps) with low latency.
Key Features
- Parallel token extraction: Generates semantic and acoustic tokens simultaneously via cascade training and parallel inference
- Low-bitrate: 0.43–0.87 kbps with flexible acoustic codebook configurations
- Low-latency streaming: Frame-level incremental processing, ~100ms decoding latency for real-time applications
- Super-resolution: Upsampling capability in the detokenizer for enhanced output quality (16k and 24k variants)
- High fidelity: At 0.87 kbps (4 codebooks) — WER 1.48, PESQ 2.30, STOI 0.921, speaker similarity 0.942
Use Cases
Designed for speech large language models (Speech LLMs), enabling efficient audio encoding and decoding. Integrates with the LongCat-Flash-Omni pipeline for real-time multi-modal interaction. Ideal for voice assistants, streaming ASR, and real-time dialogue systems.