Alibaba researchers unveiled FunAudioLLM, a groundbreaking framework designed to facilitate natural voice interaction between humans and large language models (LLMs). The system consists of two main components: SenseVoice for voice understanding and CozyVoice for voice generation.

Read the full article here- https://arxiv.org/pdf/2407.04051

SenseVoice, available in small and large variants, excels in multilingual speech recognition, emotion recognition, and audio event detection. SenseVoice-Small offers low-latency ASR for five languages, while SenseVoice-Large supports high-precision ASR for more than 50 languages.

CozyVoice, on the other hand, specializes in multilingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. It supports five languages: Chinese, English, Japanese, Cantonese and Korean.

The integration of these models with LLMs enables a variety of applications, including speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration.

Experimental results show that SenseVoice outperforms existing models such as Whisper in many benchmarks. For example, SenseVoice-Small is 5 times faster than Whisper-small and 15 times faster than Whisper-large for speech recognition tasks.

CosyVoice performs high-quality speech synthesis, achieving performance comparable to or better than original speech in terms of content consistency and speaker similarity.

The researchers have open-sourced models related to SenseVoice and CozyVoice on Modelscope and Huggingface, as well as training, inference, and fine-tuning codes on GitHub.

Although this system shows promising results, the researchers acknowledge some limitations. These include the poor performance of low-resource languages, the lack of streaming transcription capabilities, and the need to improve expressive emotional transitions while maintaining the original voice timbre.

Alibaba made it first. An image generator called Tongi, which challenged Midjourney and Del-E. This new development, FunAudioLLM, represents an important step in expanding its creative models.



Source link