Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper-large instead of whisper small? #131

Open
sleepingcat4 opened this issue Feb 15, 2025 · 17 comments
Open

Whisper-large instead of whisper small? #131

sleepingcat4 opened this issue Feb 15, 2025 · 17 comments

Comments

@sleepingcat4
Copy link

I read the code and it is very clear if I change Whsiper-small to Whisper larger, what output dim, I should change? @Plachtaa do you have any hints or directions?

@Plachtaa
Copy link
Owner

Plachtaa commented Feb 15, 2025

hi @sleepingcat4 , simply change model_params.length_regulator.in_channels to 1280 in the config file to match whisper-large encoder output dim should work, don't forget to finetune the model after you changed so

@sleepingcat4
Copy link
Author

sleepingcat4 commented Feb 15, 2025

thanks @Plachtaa for the quick answer! btw if I want to increase parameters to upto 1B, what changes to the DiT architecture should be made? do you have any advice

@Plachtaa
Copy link
Owner

I don't really suggest you to do so as the merit of VC model should be real-time and lightweight, it is not a difficult task that it worth's scaling up to 1B

@sleepingcat4
Copy link
Author

@Plachtaa I wanted to experiment and see if how it may behave since I had some spare compute. I was thinking to increase the number of hidden dim of DiT but if you could suggest some advice for experimentation only, it would be nice.

@Plachtaa
Copy link
Owner

for your reference

Image

@sleepingcat4
Copy link
Author

@Plachtaa thank you for being so helpful. another question, if I change this voice encoder model from "nvidia/bigvgan_v2_22khz_80band_256x" to "https://huggingface.co/nvidia/bigvgan_v2_44khz_128band_512x" what param I should change in the config?

@Plachtaa
Copy link
Owner

@GUUser91
Copy link

@Plachtaa and @sleepingcat4
I need help. I followed Plachtaa's instructions and edit the config_dit_mel_seed_uvit_whisper_small_wavenet.yml config file file from

in_channels: 768

to

in_channels: 1280 

and

name: "openai/whisper-small" 

to

name: "openai/whisper-large". 

I also tried changing it to

name: "openai/whisper-large-v2" 

and

name: "openai/whisper-large-v3" 

and

name: "openai/whisper-large-v3-turbo". 

Then I try fine tuning

python train.py --config ./presets/config_dit_mel_seed_uvit_whisper_small_wavenet.yml --dataset-dir Cartoon-Voice --run-name Cartoon-Voice --batch-size 2 --max-steps 600 --max-epochs 1000 --save-every 600 --num-workers 0

And no matter what whisper large model I use, the output file always start speaking in gibberish. Is it because the dataset I gathered is a total of 3 minutes and 20 seconds?
https://vocaroo.com/1cTRV93JtsPQ

For reference, source audio
https://vocaroo.com/1jKAso7ZrC4C
reference audio
https://vocaroo.com/1jM7CIP8gROA

@Plachtaa
Copy link
Owner

@GUUser91 You must run pretrain on large scale dataset once you swap encoder from whipser-small to whipser-large-v3

@sleepingcat4
Copy link
Author

@GUUser91 definitely, should gather more data. I trained on 2 hours and received awesome results!

@GUUser91
Copy link

GUUser91 commented Feb 22, 2025

@sleepingcat4 Did you fine tune a model or did you train from scratch?

@sleepingcat4
Copy link
Author

@GUUser91 I just fine-tuned

@GUUser91
Copy link

@sleepingcat4 How many steps did you fine tune the model for?

@sleepingcat4
Copy link
Author

@GUUser91 400 steps

@Gonzaluigi
Copy link

I'm doing the same, but it doesn't detect any GPU to train. It's training too slow in CPU, but I have GPU (rtx 4060)

@leminhnguyen
Copy link

leminhnguyen commented Mar 6, 2025

hi @sleepingcat4, did you change whisper-small to whisper-large-v3 and finetuned only with 4 hours of data ? One more question, 4h is total of a single-speaker dataset or multi-speaker dataset? Many thanks!

@sleepingcat4
Copy link
Author

@leminhnguyen Yes, I trained on a multiple speaker dataset did almost 4 hours of data. my data was in very high-quality, it's an 11labs dataset that I had developed and open-sourced a few days earlier on HF under my lab Sleeping AI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants