Whisper-large instead of whisper small? #131

sleepingcat4 · 2025-02-15T10:31:26Z

I read the code and it is very clear if I change Whsiper-small to Whisper larger, what output dim, I should change? @Plachtaa do you have any hints or directions?

Plachtaa · 2025-02-15T10:59:13Z

hi @sleepingcat4 , simply change model_params.length_regulator.in_channels to 1280 in the config file to match whisper-large encoder output dim should work, don't forget to finetune the model after you changed so

sleepingcat4 · 2025-02-15T11:13:03Z

thanks @Plachtaa for the quick answer! btw if I want to increase parameters to upto 1B, what changes to the DiT architecture should be made? do you have any advice

Plachtaa · 2025-02-15T11:46:26Z

I don't really suggest you to do so as the merit of VC model should be real-time and lightweight, it is not a difficult task that it worth's scaling up to 1B

sleepingcat4 · 2025-02-15T11:48:41Z

@Plachtaa I wanted to experiment and see if how it may behave since I had some spare compute. I was thinking to increase the number of hidden dim of DiT but if you could suggest some advice for experimentation only, it would be nice.

Plachtaa · 2025-02-15T11:53:54Z

for your reference

sleepingcat4 · 2025-02-15T14:09:30Z

@Plachtaa thank you for being so helpful. another question, if I change this voice encoder model from "nvidia/bigvgan_v2_22khz_80band_256x" to "https://huggingface.co/nvidia/bigvgan_v2_44khz_128band_512x" what param I should change in the config?

Plachtaa · 2025-02-15T14:20:23Z

see

seed-vc/configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml

Line 14 in 97544ff

preprocess_params:

GUUser91 · 2025-02-21T02:06:24Z

@Plachtaa and @sleepingcat4
I need help. I followed Plachtaa's instructions and edit the config_dit_mel_seed_uvit_whisper_small_wavenet.yml config file file from

in_channels: 768

to

in_channels: 1280

and

name: "openai/whisper-small"

to

name: "openai/whisper-large".

I also tried changing it to

name: "openai/whisper-large-v2"

and

name: "openai/whisper-large-v3"

and

name: "openai/whisper-large-v3-turbo".

Then I try fine tuning

python train.py --config ./presets/config_dit_mel_seed_uvit_whisper_small_wavenet.yml --dataset-dir Cartoon-Voice --run-name Cartoon-Voice --batch-size 2 --max-steps 600 --max-epochs 1000 --save-every 600 --num-workers 0

And no matter what whisper large model I use, the output file always start speaking in gibberish. Is it because the dataset I gathered is a total of 3 minutes and 20 seconds?
https://vocaroo.com/1cTRV93JtsPQ

For reference, source audio
https://vocaroo.com/1jKAso7ZrC4C
reference audio
https://vocaroo.com/1jM7CIP8gROA

Plachtaa · 2025-02-21T07:58:28Z

@GUUser91 You must run pretrain on large scale dataset once you swap encoder from whipser-small to whipser-large-v3

sleepingcat4 · 2025-02-22T17:24:05Z

@GUUser91 definitely, should gather more data. I trained on 2 hours and received awesome results!

GUUser91 · 2025-02-22T18:12:31Z

@sleepingcat4 Did you fine tune a model or did you train from scratch?

sleepingcat4 · 2025-02-22T18:14:28Z

@GUUser91 I just fine-tuned

GUUser91 · 2025-02-22T18:15:25Z

@sleepingcat4 How many steps did you fine tune the model for?

sleepingcat4 · 2025-02-22T18:24:55Z

@GUUser91 400 steps

Gonzaluigi · 2025-02-24T12:59:18Z

I'm doing the same, but it doesn't detect any GPU to train. It's training too slow in CPU, but I have GPU (rtx 4060)

leminhnguyen · 2025-03-06T07:54:08Z

hi @sleepingcat4, did you change whisper-small to whisper-large-v3 and finetuned only with 4 hours of data ? One more question, 4h is total of a single-speaker dataset or multi-speaker dataset? Many thanks!

sleepingcat4 · 2025-03-06T16:41:34Z

@leminhnguyen Yes, I trained on a multiple speaker dataset did almost 4 hours of data. my data was in very high-quality, it's an 11labs dataset that I had developed and open-sourced a few days earlier on HF under my lab Sleeping AI.

GUUser91 mentioned this issue Feb 20, 2025

Help with fine tuned accent #133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper-large instead of whisper small? #131

Whisper-large instead of whisper small? #131

sleepingcat4 commented Feb 15, 2025

Plachtaa commented Feb 15, 2025 •

edited

Loading

sleepingcat4 commented Feb 15, 2025 •

edited

Loading

Plachtaa commented Feb 15, 2025

sleepingcat4 commented Feb 15, 2025

Plachtaa commented Feb 15, 2025

sleepingcat4 commented Feb 15, 2025

Plachtaa commented Feb 15, 2025

GUUser91 commented Feb 21, 2025

Plachtaa commented Feb 21, 2025

sleepingcat4 commented Feb 22, 2025

GUUser91 commented Feb 22, 2025 •

edited

Loading

sleepingcat4 commented Feb 22, 2025

GUUser91 commented Feb 22, 2025

sleepingcat4 commented Feb 22, 2025

Gonzaluigi commented Feb 24, 2025

leminhnguyen commented Mar 6, 2025 •

edited

Loading

sleepingcat4 commented Mar 6, 2025

Whisper-large instead of whisper small? #131

Whisper-large instead of whisper small? #131

Comments

sleepingcat4 commented Feb 15, 2025

Plachtaa commented Feb 15, 2025 • edited Loading

sleepingcat4 commented Feb 15, 2025 • edited Loading

Plachtaa commented Feb 15, 2025

sleepingcat4 commented Feb 15, 2025

Plachtaa commented Feb 15, 2025

sleepingcat4 commented Feb 15, 2025

Plachtaa commented Feb 15, 2025

GUUser91 commented Feb 21, 2025

Plachtaa commented Feb 21, 2025

sleepingcat4 commented Feb 22, 2025

GUUser91 commented Feb 22, 2025 • edited Loading

sleepingcat4 commented Feb 22, 2025

GUUser91 commented Feb 22, 2025

sleepingcat4 commented Feb 22, 2025

Gonzaluigi commented Feb 24, 2025

leminhnguyen commented Mar 6, 2025 • edited Loading

sleepingcat4 commented Mar 6, 2025

Plachtaa commented Feb 15, 2025 •

edited

Loading

sleepingcat4 commented Feb 15, 2025 •

edited

Loading

GUUser91 commented Feb 22, 2025 •

edited

Loading

leminhnguyen commented Mar 6, 2025 •

edited

Loading