-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: TBE Subprocess Task Distribute Failure When TP>1 #198
Comments
Could you please provide the |
The full content of from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen2.5-0.5B-Instruct",
tensor_parallel_size=2
) The script has only one function: loading the model. This issue was discovered in another project where, when
This is the image that was pulled. |
I know the reason now. It's not that the model wasn't loaded successfully, but that the model wasn't exited successfully. Adding code related to manually cleaning up objects, with reference to the sample, can resolve this issue. The cleaning up example is import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct",
tensor_parallel_size=2,
distributed_executor_backend="mp",
max_model_len=26240)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up() |
@shen-shanshan is this probelm shown in FAQ? If not, please add one, thanks. |
ok 👌 |
Your current environment
The output of `npu-smi info`
The output of `cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info`
The output of `python collect_env.py`
🐛 Describe the bug
When the tensor parallelism size is greater than 1, that is, when the parameter
tensor_parallel_size
is set to a value greater than 1, the error "TBE Subprocess[task_distribute] raise error[], main process disappeared!" is reported.The text was updated successfully, but these errors were encountered: