The Riva text-to-speech or speech synthesis skill generates human-like speech and uses non-autoregressive models to deliver 12x higher performance on NVIDIA A100 GPUs when compared with Tacotron 2 and WaveGlow models on NVIDIA V100 GPUs.
It includes vocabulary from telecommunications, podcasting, and healthcare to deliver world-class accuracy in production use cases. The speech recognition skill is trained and evaluated on a wide variety of real-world, domain-specific datasets. Riva provides highly optimized services for speech recognition and speech synthesis for use cases like real-time transcription and virtual assistants. Riva uses NVIDIA Triton Inference Server to serve multiple models for efficient and robust resource allocation, as well as to achieve high performance in terms of high throughput, low latency, and high accuracy. You can start using the pretrained models or fine-tune them with your own dataset to further improve model performance. All these AI services are trained with thousands of hours of public and internal datasets to reach high accuracy. Task-specific AI services and gRPC endpoints provide out-of-the-box, high-performance ASR, NLP, and TTS. Optimize for inference to offer real-time services that run in 150 milliseconds (ms) compared to the 25 seconds required on CPU-only platforms. Using Riva, you can easily fine-tune state-of-art-models on your data to achieve a deeper understanding of their specific contexts. The Riva SDK includes pretrained speech and language models, the NVIDIA TAO Toolkit for fine-tuning these models on a custom dataset, and optimized end-to-end skills for speech recognition, language understanding, and speech synthesis. Riva workflow for building speech applications With a few commands, you can access the high-performance services through API operations and try demos.įigure 1.
Riva is designed to help you access conversational AI functionalities easily and quickly. NVIDIA Riva is a GPU-accelerated SDK for developing speech AI applications.
NVIDIA Riva streamlines the end-to-end process of developing speech AI services and provides real-time performance for human-like interactions. It also means running in real time, with far under 300 milliseconds to have natural interactions with users. Building speech AI applications requires hundreds of thousands of hours of audio data, tools to build and customize models based on your specific use case, and scalable deployment support. The ASR pipeline takes raw audio and converts it to text, and the TTS pipeline takes text and converts it to audio.ĭeveloping and running these real-time speech AI services is a complex and difficult task. Speech AI includes automatic speech recognition (ASR) and text-to-Speech (TTS). Speech AI is used in a variety of applications, including call centers for empowering human agents, speech interface for virtual assistants, and live captioning in video conferencing.