-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray OOM causes the process to be killed #429
Comments
same issue |
I suspect that this is memory leak when saving checkpoint. |
But I only saved the checkpoint once or twice during the training process, so if it's a checkpoint saving memory leak, then there shouldn't be a linear increase in memory usage before saving. |
Then, I highly doubt the reward function that you use have memory leak... Try to switch to dummy reward function to see if there is memory leak |
@PeterSH6 would you mind adding the label |
@PKU-Fgx, this could be either a veRL issue, a Ray issue, or both. You can use jemalloc to profile it (see ray-project/ray#51031). If it turns out to be a Ray issue after you profile it, I'll take a look. |
Okay, I'll look into that in the next few days. |
Did you use vLLM 0.7+? It seems that the recent versions of vLLM are the cause of memory leak. You can try vllm 0.6.3 |
same issue for vllm 0.7.3, i didn't save any checkpoint. |
same issue on 32B with 16 nodes. use vllm 0.6.3 |
In my case, I set |
it does not work well for me, my version is 0.7.2 |
it does not work well for me, my version is 0.6.3 |
@kevin85421 Hi!I‘ve generated some I apologize if these questions seem basic, and any insights you could offer would be incredibly helpful. Thank you so much for your time and expertise! |
Hi, I follow the PR and here is what I get in the middle of training where leakage happens:
Is the file I get correct? All the memory usage is relatively small. |
@PKU-Fgx, maybe you can profile Ray core worker processes and compare different memory dumps to see how much memory they contribute. |
@wzq016 you profiles the GCS process. You can profile core worker processes. |
@kevin85421 I think I have discovered a clue indicating a continuously increasing memory usage. I used |
@PKU-Fgx We have found this issue is caused by vllm-project/vllm#14326 and this PR could solve the memory leak problem |
@hiyouga It works! So it's a vLLM problem, thanks a lot! |
An example script to fix this problem: export VLLM_COMMIT=227578480d71fc94ef46ca77fb69496412158d68
sudo pip install vllm --pre --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}
git clone -b verl_v1 https://github.com/hiyouga/vllm.git
sudo cp -r vllm/vllm/ /usr/local/lib/python3.10/dist-packages/ |
I found that as the training progressed, the System Memory Utilization (%) skyrocketed, and after a fixed point ray would report an out of memory error that would crash the training process.
Or is there some parameter I haven't configured correctly that is causing the memory usage to keep increasing?
The text was updated successfully, but these errors were encountered: