Fine-tuning SpeechT5 to Imitate Me

Machine learning has changed the world in the last couple of years. I was a consumer of these models but only a spectator when it came to understanding how they worked. In this post, I’ll share my experience trying to train the SpeechT5 text-to-speech model to imitate my own voice.

As I often do nowadays, I started by asking questions to ChatGPT, Grok, and Claude. They are great tools for learning. I also familiarized myself with Hugging Face. If you haven’t used it before, it’s like GitHub but for machine learning models.

I tried a few different models at first, namely Coqui, SpeechT5, and Dia. I chose SpeechT5 because the instructions were clearer and the dependencies fewer than the others. But you should check out Dia's demo, it is incredible.

To make this work, I had to record myself first. I asked Claude to generate a table of filenames and transcriptions. I recorded my voice with Audacity, then wrote a Python script to process the recordings. It’s best to trim silence, make the audio mono, normalize it, and resample it to 16 kHz. Since I’m comfortable with signal processing, this part was easy, and Python’s audio libraries made it even smoother. It’s easy to see why Python is such a popular language.

The next step was creating the speaker embedding. A speaker embedding is basically an array of floating-point numbers representing someone’s voice characteristics—in this case, mine. You can see the values below but it does not mean much to humans.

tensor([-0.0694, -0.0536, -0.0454, -0.0202,  0.0246,  0.0315,  0.0061,  0.0169,
         0.0244,  0.0273,  0.0311,  0.0291,  0.0320,  0.0250,  0.0165,  0.0154,
         0.0166,  0.0136, -0.0011, -0.0153, -0.0247, -0.0348, -0.0390, -0.0376,
        -0.0399, -0.0333, -0.0318, -0.0360, -0.0291, -0.0306, -0.0392, -0.0420,
        -0.0400, -0.0354, -0.0329, -0.0323, -0.0282, -0.0239, -0.0239, -0.0245,
        -0.0225, -0.0221, -0.0266, -0.0380, -0.0494, -0.0558, -0.0513, -0.0394,
        -0.0308, -0.0254, -0.0268, -0.0333, -0.0396, -0.0429, -0.0407, -0.0396,
        -0.0401, -0.0401, -0.0412, -0.0482, -0.0535, -0.0506, -0.0489, -0.0508,
        -0.0531, -0.0564, -0.0573, -0.0607, -0.0647, -0.0624, -0.0613, -0.0618,
        -0.0628, -0.0617, -0.0549, -0.0487, -0.0499, -0.0517, -0.0617, -0.0946,
         0.1067,  0.1038,  0.1044,  0.1078,  0.1150,  0.1166,  0.1083,  0.1124,
         0.1162,  0.1186,  0.1183,  0.1218,  0.1186,  0.1160,  0.1140,  0.1137,
         0.1157,  0.1140,  0.1123,  0.1069,  0.1025,  0.0999,  0.0984,  0.0984,
         0.0990,  0.1003,  0.0995,  0.0998,  0.0996,  0.0981,  0.0979,  0.0962,
         0.0976,  0.0970,  0.0952,  0.0935,  0.0957,  0.0936,  0.0955,  0.0965,
         0.0974,  0.0981,  0.0977,  0.0957,  0.0902,  0.0880,  0.0879,  0.0916,
         0.0950,  0.0959,  0.0945,  0.0934,  0.0933,  0.0910,  0.0908,  0.0932,
         0.0942,  0.0939,  0.0931,  0.0889,  0.0864,  0.0866,  0.0881,  0.0876,
         0.0878,  0.0879,  0.0879,  0.0879,  0.0863,  0.0859,  0.0858,  0.0853,
         0.0859,  0.0850,  0.0860,  0.0885,  0.0876,  0.0872,  0.0857,  0.0783,
         0.0170,  0.0170,  0.0171,  0.0172,  0.0172,  0.0173,  0.0174,  0.0174,
         0.0175,  0.0176,  0.0176,  0.0177,  0.0177,  0.0178,  0.0179,  0.0179,
         0.0180,  0.0180,  0.0181,  0.0182,  0.0182,  0.0183,  0.0183,  0.0184,
         0.0184,  0.0185,  0.0185,  0.0186,  0.0186,  0.0187,  0.0187,  0.0188,
         0.0188,  0.0189,  0.0189,  0.0190,  0.0190,  0.0191,  0.0191,  0.0192,
         0.0192,  0.0192,  0.0193,  0.0193,  0.0194,  0.0194,  0.0194,  0.0195,
         0.0195,  0.0195,  0.0196,  0.0196,  0.0196,  0.0197,  0.0197,  0.0197,
         0.0198,  0.0198,  0.0198,  0.0199,  0.0199,  0.0199,  0.0199,  0.0200,
         0.0200,  0.0200,  0.0200,  0.0201,  0.0201,  0.0201,  0.0201,  0.0201,
         0.0202,  0.0202,  0.0202,  0.0202,  0.0202,  0.0202,  0.0203,  0.0203,
         0.0203,  0.0203,  0.0203,  0.0203,  0.0203,  0.0203,  0.0203,  0.0204,
         0.0204,  0.0204,  0.0204,  0.0204,  0.0204,  0.0204,  0.0204,  0.0204,
         0.0204,  0.0204,  0.0204,  0.0204,  0.0204,  0.0204,  0.0204,  0.0204,
         0.0204,  0.0204,  0.0203,  0.0203,  0.0203,  0.0203,  0.0203,  0.0203,
         0.0203,  0.0203,  0.0203,  0.0202,  0.0202,  0.0202,  0.0202,  0.0202,
         0.0202,  0.0201,  0.0201,  0.0201,  0.0201,  0.0201,  0.0200,  0.0200,
         0.0200,  0.0200,  0.0199,  0.0199,  0.0199,  0.0199,  0.0198,  0.0198,
         0.0198,  0.0197,  0.0197,  0.0197,  0.0196,  0.0196,  0.0196,  0.0195,
         0.0195,  0.0195,  0.0194,  0.0194,  0.0194,  0.0193,  0.0193,  0.0192,
         0.0192,  0.0192,  0.0191,  0.0191,  0.0190,  0.0190,  0.0189,  0.0189,
         0.0188,  0.0188,  0.0187,  0.0187,  0.0186,  0.0186,  0.0185,  0.0185,
         0.0184,  0.0184,  0.0183,  0.0183,  0.0182,  0.0182,  0.0181,  0.0180,
         0.0180,  0.0179,  0.0179,  0.0178,  0.0177,  0.0177,  0.0176,  0.0176,
         0.0175,  0.0174,  0.0174,  0.0173,  0.0172,  0.0172,  0.0171,  0.0170,
         0.0170,  0.0169,  0.0168,  0.0167,  0.0167,  0.0166,  0.0165,  0.0164,
         0.0164,  0.0163,  0.0162,  0.0161,  0.0161,  0.0160,  0.0159,  0.0158,
         0.0158,  0.0157,  0.0156,  0.0155,  0.0154,  0.0154,  0.0153,  0.0152,
         0.0151,  0.0150,  0.0149,  0.0149,  0.0148,  0.0147,  0.0146,  0.0145,
         0.0144,  0.0143,  0.0142,  0.0141,  0.0141,  0.0140,  0.0139,  0.0138,
         0.0137,  0.0136,  0.0135,  0.0134,  0.0133,  0.0132,  0.0131,  0.0130,
         0.0129,  0.0128,  0.0127,  0.0126,  0.0125,  0.0124,  0.0123,  0.0122,
         0.0121,  0.0120,  0.0119,  0.0118,  0.0117,  0.0116,  0.0115,  0.0114,
         0.0113,  0.0112,  0.0111,  0.0110,  0.0109,  0.0108,  0.0107,  0.0106,
         0.0105,  0.0104,  0.0103,  0.0102,  0.0100,  0.0099,  0.0098,  0.0097,
         0.0096,  0.0095,  0.0094,  0.0093,  0.0092,  0.0091,  0.0089,  0.0088,
         0.0087,  0.0086,  0.0085,  0.0084,  0.0083,  0.0081,  0.0080,  0.0079,
         0.0078,  0.0077,  0.0076,  0.0075,  0.0073,  0.0072,  0.0071,  0.0070,
         0.0069,  0.0068,  0.0066,  0.0065,  0.0064,  0.0063,  0.0062,  0.0060,
         0.0059,  0.0058,  0.0057,  0.0056,  0.0054,  0.0053,  0.0052,  0.0051,
         0.0050,  0.0048,  0.0047,  0.0046,  0.0045,  0.0043,  0.0042,  0.0041,
         0.0040,  0.0039,  0.0037,  0.0036,  0.0035,  0.0034,  0.0032,  0.0031,
         0.0030,  0.0029,  0.0027,  0.0026,  0.0025,  0.0024,  0.0022,  0.0021,
         0.0020,  0.0019,  0.0017,  0.0016,  0.0015,  0.0014,  0.0013,  0.0011,
         0.0010,  0.0009,  0.0008,  0.0006,  0.0005,  0.0004,  0.0003,  0.0001],
       device='cuda:0')

Speaker embedding tensor

Finally came training the model. My first attempt was surprisingly quick since I ran it on the GPU, and after hours of preparation, I was eager to hear the result. Unfortunately, the model had collapsed and produced only horrible noise. I tried again with a much lower learning rate, watching the loss value gradually decrease and feeling hopeful. But the model collapsed again, generating the same noise. I later learned that it’s not possible to train a model like SpeechT5 properly with only ten minutes of audio.

I said that the numbers in speaker embedding are meaningless to humans, but that is not entirely true. If you look closely to the numbers, the first 160 are varied in sign and amplitude but after that they hover around 0.018. This might mean that the training data was limited (exactly my case) or it is partially trained.

After some research, I found out that instead of training the full model, I could fine-tune it using my speaker embedding. Since SpeechT5 already knows how to speak English, I could influence it to use my vocal characteristics. This approach worked better. The generated audio was still noisy and robotic compared to the model’s original voices, but it definitely resembled my voice and captured my speech rhythm.”

There is still a lot to experiment with. First I want this model to properly clone my voice. And then I want to play with more expressive models like Dia.

If you think that this article is wrong or missing, or maybe you have a question, please feel free to send me a message.

Go to source