Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading content…
Loading…
Transcript

Personalized Voice Cloning Using GANs

Personalized Voice Simulation Using GANs

TEAM MEMBERS:

Sai Akhil K - 1MS16EC087

Vignesh A - 1MS16EC130

A V Phani Koushik - 1MS12EC024

Introduction

GUIDE:

Dr. S. SETHU SELVI

HOD

Dept. of E&C

RIT, Bangalore

Objective

PROBLEM STATEMENT:

Recreating a person’s speech with appropriate emotions and pronunciation is of critical importance as the applications of these are numerous. Most methods already available require multiple hours of recording and computational power and ultimately are not efficient.

OBJECTIVE:

Restoring the ability to communicate naturally to users who have lost their voice by building a stylized TTS system that can recreate a person’s speech identity with only a few samples of input.

Applications of Personalized Voice:

  • Restoring voice digitally to people that use speaking aids.
  • Can be used in mobile Voice Assistants, Audio Books, Music Industry.
  • Can be used for Dubbing in Movies.
  • Providing international content with language-based text-to-speech applications.
  • Recreating voices of the celebrities of yesteryear, such as long-dead politicians and actors.

Data sets used

Celeb vox

Libri speech

100 Hours of Celebrity voices.

100 hours of clean audio with transcription.

Resources

TESS

VCTK

Toronto Emotional Speech Set

109 Native English Speakers

Each speaker reads out about 400 sentences

2 speakers, 200 samples each

Papers Used:

  • [1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets”, in Advances in neural information processing systems, 2014, pp.2672–2680.

  • [2] Ye Jia∗ Yu Zhang∗ Ron J. Weiss∗ Quan Wang Jonathan Shen Fei Ren Zhifeng Chen Patrick Nguyen Ruoming Pang Ignacio Lopez Moreno Yonghui Wu : "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis", 2019.

  • [3] Yang Gao∗, Rita Singh†, Bhiksha Raj : "Voice impersonation using generative adversarial networks",2018.

Papers

Literature Survey

Generative Adversarial Nets [1]

  • New frame work for estimating generative models Via adversarial process in which two models are simultaneously trained.
  • A generator model G that captures data distribution and Discriminatory model D that estimated the probability that a sample came from the training data rather than the G.
  • The training procedure for G is to maximize the probability of D making a mistake.
  • G is trained till D gives an output of 1/2 for all the samples irrespective of origin.
  • In the case G and D are defined by multilayer perceptrons the entire system can be trained by back propagation.

VOICE IMPERSONATION USING GENERATIVE ADVERSARIAL NETWORKS [2]

  • In order to apply the GAN to speech, speech should be converted it to an invertible, picture-like representation, namely a spectrogram.
  • GAN was originally designed to operate on images of fixed size. For it to work with inherently variable-sized speech signals, this constraint must be relaxed in its new design.
  • It is important to ensure that the linguistic information in the speech signal is not lost, even though the signal itself is modified.
  • The following factors were addressed to update traditional GAN suitable for speech:

  • Retaining Linguistic Information

  • Variable-length Input Generator and Discriminator

  • Style Embedding Model(DS)

  • Total Loss

Flow Chart

Speech Synthesis Using LSTM

Three main steps of Implementation:

Synthesizer:

Vectors obtained in Encoder used to generate new spectrogram on new text

Vocoder:

Generates spectrogram to TTS output

Encoder:

"Encodes" Speaker's voice into vector embeddings

Flow of Generating output

1. Target Speaker Voice is give as input to Speaker Encoder

2. Text to be converted to voice is given as input to Encoder of Synthesizer

3. The output of the voice of Target Speaker that resents the sentence given as input is received from Vocoder

Training of Encoder,Synthesizer and Vocoder

Description of Encoder:

Architecture:

  • The model is a 3-layer LSTM with 768 hidden nodes followed by a projection layer of 256 units.
  • The inputs to the model are 40-channels log-mel spectrograms with a 25ms window width and a 10ms step.

Functionality:

  • The speaker encoder is trained on a speaker verification task.
  • This task helps is generating feature embedding which basically tells us which part of the speech is important and unique to speakers.

Training:

  • Encoder is trained on LibriSpeech, VoxCeleb1, VoxCeleb2 and VCTK datasets

Synthesizer Block Representation:

Description of Synthesizer:

Architecture:

  • It basically consists of two parts a Encoder and a Decoder
  • The Synthesizer that we are using is based on Tactron 2

Flow:

  • Texts are Embedded into vectors before entering the neural network
  • Bidirectional LSTM is used to produce encoder output
  • The most important part happens after the encoder produces output
  • Speaker Embeddings of target speaker are added to the output of encoder
  • This is then fed to decoder which produces a Mel Spectrogram

Training:

  • Synthesizer is trained on LibriSpeech and VCTK datasets

Vocoder:

  • This is the final part and is responsible for producing audio output.
  • This is implemented using WaveRNN.
  • Documentation for this part is not clear hence we relied on the source code to implement this part.
  • Vocoder converts a Log Mel Spectrogram into a audio sample.

Training:

  • Training for Vocoder is done using LibriSpeech dataset.

Draw Backs:

  • Needs rigorous training.
  • The model does not understand punctuation.
  • It also does not understand how to process emotions.

Introduction

Emotion Transfer

  • This section of the project deals with emotion transfer on synthesized speech.
  • The main objective of this half of the project is to recreate the nuances present in speech as well as emotions like anger,sadness,etc.

  • This section will be divided into 3 parts:
  • Model Architecture
  • Pre-processing steps
  • Training

Generator Architecture

Architecture

Discriminator Architecture

Discriminator

Pre-processing

Pre-processing

  • The main challenge in pre-processing is in converting 1D audio samples to 2D images

  • To pre-process, the following steps must be carried out:
  • Split the audio sample into frames.
  • To each frame, decompose into f0 and spectral envelope
  • Encode the envelope to generate MFCCs.
  • Stack the MFCCs on top of each other to generate 2D matrix

Training Details

  • Training was conducted for 140 epochs on a GTX980M.

  • Audio Samples were sampled at 16 kHz.

  • Each frame was sampled with a window length of 5 milliseconds sliding.

Training

Gen Loss after 140 epochs

Results

Avg. Dis Loss for every 10 epochs

Neural Voice Cloning

Training of Multi-Speaker Generative Model

Voice Cloning of Target speaker

Audio generation

Mathematical representation of speaker adaptation

Drawbacks

  • The generative model is trained using Western speakers. When Indian target speaker is given, it requires huge amount of data to influence the generative model which defeats the purpose of cloning with few voice samples.
  • Gives neutral emotion which doesn't convince human ear irrespective of mathematical accuracy we achive.

Conclusion:

Conclusion

  • Desired accuracy wasn't achieved as the training happend on mediocore machines.
  • We were not able to implement the model for indian voices due to lack of resources.
  • Decent accuracy was achieved in all the parts for western data while for Indian data it was only possible using encode and synthesizer model

Pretrained Outputs can be found here:

https://my-voice-8a71f.web.app/

Future Work:

Future Work

  • Increase the trainig size to get the desired outputs.
  • Implement the model for Indian English speakers.
  • Try to implement the same model on different indian laguages.

  • Once the above is achieved we would like to commercialise the solution for market that includes voice msgs, movie industry, audio books etc.
Learn more about creating dynamic, engaging presentations with Prezi