Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错 #1983

Open
RongRongStudio opened this issue Mar 10, 2025 · 0 comments
Open

910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错 #1983

RongRongStudio opened this issue Mar 10, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@RongRongStudio
Copy link

Describe the bug/ 问题描述 (Mandatory / 必填)
A clear and concise description of what the bug is.
910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU/CPU/kirin/等其他芯片
Ascend 910A 32G

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    Ascend HDK24.1.RC3
    CANN 8.0.0
    MindSpore 2.5.0
  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
参考执行llm/inference/llama3
To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error
    msrun --worker_num=4 --local_worker_num=4 --master_port=8118 --join=True --bind_core=True run_llama3_distributed.py
    Expected behavior / 预期结果 (Mandatory / 必填)
    A clear and concise description of what you expected to happen.

Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 25%|█████████████████████████▌ | 1/4 [00:16<00:49, 16.39s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 25%|█████████████████████████▌ | 1/4 [00:17<00:52, 17.51s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 50%|███████████████████████████████████████████████████ | 2/4 [00:25<00:24, 12.13s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 50%|███████████████████████████████████████████████████ | 2/4 [00:32<00:32, 16.43s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
Traceback (most recent call last):
File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
outputs = model.generate(
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
return mindspore.tensor(token, dtype=mindspore.int64)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
_check_input_data_type(input_data)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:40<00:12, 12.50s/it]low_cpu_mem usage is not avaliable.
[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:43<00:14, 14.47s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
Traceback (most recent call last):
File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
outputs = model.generate(
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
return mindspore.tensor(token, dtype=mindspore.int64)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
_check_input_data_type(input_data)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:312] Worker process 162172 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:318] There's worker exits with exception, kill all other workers.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.720.000 [mindspore/parallel/cluster/process_entity/_api.py:331] Scheduler process 162146 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.721.000 [mindspore/parallel/cluster/process_entity/_api.py:334] Analyzing exception log...
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.722.000 [mindspore/parallel/cluster/process_entity/_api.py:431] Time out nodes are ['3']
scheduler.log-58-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:20.507.092 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-59-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.237 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-60-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.316 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-61-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.452 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-62-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.508 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log:63:[ERROR] DISTRIBUTED(162146,ffff2b7ef120,python):2025-03-10-09:40:31.015.657 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 3 is timed out. It may exit with exception, please check this node's log.
scheduler.log:64:[ERROR] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:35.507.622 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes.
scheduler.log:65:Traceback (most recent call last):
scheduler.log-66- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 7, in
scheduler.log-67- init()
scheduler.log-68- File "/usr/local/lib/python3.10/dist-packages/mindspore/communication/management.py", line 198, in init
scheduler.log-69- init_cluster()
scheduler.log:70:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{3}, worker 3 is the first one timed out, please check its log.
scheduler.log-71-
scheduler.log-72-----------------------------------------------------
scheduler.log-73-- C++ Call Stack: (For framework developers)
scheduler.log-74-----------------------------------------------------
scheduler.log-75-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState

worker_0.log-44-Sliding Window Attention is enabled but not implemented for eager; unexpected results may be encountered.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
worker_0.log:49:Traceback (most recent call last):
worker_0.log-50- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
worker_0.log-51- outputs = model.generate(
worker_0.log-52- File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_0.log-53- return func(*args, **kwds)
worker_0.log-54- File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate

worker_0.log-60- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_0.log-61- return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
worker_0.log-62- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
worker_0.log-63- _check_input_data_type(input_data)
worker_0.log-64- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_0.log:65: raise TypeError(
worker_0.log:66:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.

worker_3.log-43-Sliding Window Attention is enabled but not implemented for eager; unexpected results may be encountered.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:16<00:32, 16.07s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
worker_3.log:48:Traceback (most recent call last):
worker_3.log-49- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
worker_3.log-50- outputs = model.generate(
worker_3.log-51- File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_3.log-52- return func(*args, **kwds)
worker_3.log-53- File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate

worker_3.log-59- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_3.log-60- return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
worker_3.log-61- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
worker_3.log-62- _check_input_data_type(input_data)
worker_3.log-63- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_3.log:64: raise TypeError(
worker_3.log:65:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
worker_3.log-66-[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
worker_3.log-67-[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Traceback (most recent call last):
File "/usr/local/bin/msrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 150, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 144, in run
process_manager.run()
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 225, in run
self.join_processes()
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 336, in join_processes
raise RuntimeError("Distributed job exited with exception. Please check logs in "
RuntimeError: Distributed job exited with exception. Please check logs in directory: .

Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.

@RongRongStudio RongRongStudio added the bug Something isn't working label Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant