These three models are faster than real time, and there's a lot of information available and a big community built up around them. You can mimic singing and emotion pretty easily. TalkNet is also popular when a secondary reference pitch signal is supplied. Input text => Text pre-processing => Synthesizer => Vocoder => => Output audio Your pipeline looks like this at a high level: You'll want to pair it with the Hifi-Gan vocoder to get end-to-end text to speech. You're looking for Tacotron 2 or one of its offshoots that add multi-speaker, TorchMoji, etc. It's good for creatives making one-off deepfake YouTube videos, and that's about it. Tortoise produces quality results with limited training data, but is an extremely slow model that is not suitable for real time use cases. I'm the author of and can speak to Tortoise and the TTS field.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |