-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper-large instead of whisper small? #131
Comments
hi @sleepingcat4 , simply change |
thanks @Plachtaa for the quick answer! btw if I want to increase parameters to upto 1B, what changes to the DiT architecture should be made? do you have any advice |
I don't really suggest you to do so as the merit of VC model should be real-time and lightweight, it is not a difficult task that it worth's scaling up to 1B |
@Plachtaa I wanted to experiment and see if how it may behave since I had some spare compute. I was thinking to increase the number of hidden dim of DiT but if you could suggest some advice for experimentation only, it would be nice. |
@Plachtaa thank you for being so helpful. another question, if I change this voice encoder model from "nvidia/bigvgan_v2_22khz_80band_256x" to "https://huggingface.co/nvidia/bigvgan_v2_44khz_128band_512x" what param I should change in the config? |
see
|
@Plachtaa and @sleepingcat4
to
and
to
I also tried changing it to
and
and
Then I try fine tuning
And no matter what whisper large model I use, the output file always start speaking in gibberish. Is it because the dataset I gathered is a total of 3 minutes and 20 seconds? For reference, source audio |
@GUUser91 You must run pretrain on large scale dataset once you swap encoder from whipser-small to whipser-large-v3 |
@GUUser91 definitely, should gather more data. I trained on 2 hours and received awesome results! |
@sleepingcat4 Did you fine tune a model or did you train from scratch? |
@GUUser91 I just fine-tuned |
@sleepingcat4 How many steps did you fine tune the model for? |
@GUUser91 400 steps |
I'm doing the same, but it doesn't detect any GPU to train. It's training too slow in CPU, but I have GPU (rtx 4060) |
hi @sleepingcat4, did you change whisper-small to whisper-large-v3 and finetuned only with 4 hours of data ? One more question, 4h is total of a single-speaker dataset or multi-speaker dataset? Many thanks! |
@leminhnguyen Yes, I trained on a multiple speaker dataset did almost 4 hours of data. my data was in very high-quality, it's an 11labs dataset that I had developed and open-sourced a few days earlier on HF under my lab Sleeping AI. |
I read the code and it is very clear if I change Whsiper-small to Whisper larger, what output dim, I should change? @Plachtaa do you have any hints or directions?
The text was updated successfully, but these errors were encountered: