You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
参考执行llm/inference/llama3 To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:
Go to '...'
Click on '....'
Scroll down to '....'
See error
msrun --worker_num=4 --local_worker_num=4 --master_port=8118 --join=True --bind_core=True run_llama3_distributed.py Expected behavior / 预期结果 (Mandatory / 必填)
A clear and concise description of what you expected to happen.
Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 25%|█████████████████████████▌ | 1/4 [00:16<00:49, 16.39s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 25%|█████████████████████████▌ | 1/4 [00:17<00:52, 17.51s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 50%|███████████████████████████████████████████████████ | 2/4 [00:25<00:24, 12.13s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 50%|███████████████████████████████████████████████████ | 2/4 [00:32<00:32, 16.43s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
Traceback (most recent call last):
File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
outputs = model.generate(
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
return mindspore.tensor(token, dtype=mindspore.int64)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
_check_input_data_type(input_data)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:40<00:12, 12.50s/it]low_cpu_mem usage is not avaliable.
[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:43<00:14, 14.47s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
Traceback (most recent call last):
File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
outputs = model.generate(
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
return mindspore.tensor(token, dtype=mindspore.int64)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
_check_input_data_type(input_data)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:312] Worker process 162172 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:318] There's worker exits with exception, kill all other workers.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.720.000 [mindspore/parallel/cluster/process_entity/_api.py:331] Scheduler process 162146 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.721.000 [mindspore/parallel/cluster/process_entity/_api.py:334] Analyzing exception log...
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.722.000 [mindspore/parallel/cluster/process_entity/_api.py:431] Time out nodes are ['3']
scheduler.log-58-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:20.507.092 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-59-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.237 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-60-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.316 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-61-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.452 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-62-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.508 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log:63:[ERROR] DISTRIBUTED(162146,ffff2b7ef120,python):2025-03-10-09:40:31.015.657 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 3 is timed out. It may exit with exception, please check this node's log.
scheduler.log:64:[ERROR] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:35.507.622 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes.
scheduler.log:65:Traceback (most recent call last):
scheduler.log-66- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 7, in
scheduler.log-67- init()
scheduler.log-68- File "/usr/local/lib/python3.10/dist-packages/mindspore/communication/management.py", line 198, in init
scheduler.log-69- init_cluster()
scheduler.log:70:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{3}, worker 3 is the first one timed out, please check its log.
scheduler.log-71-
scheduler.log-72-----------------------------------------------------
scheduler.log-73-- C++ Call Stack: (For framework developers)
scheduler.log-74-----------------------------------------------------
scheduler.log-75-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState
worker_0.log-44-Sliding Window Attention is enabled but not implemented for eager; unexpected results may be encountered.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
worker_0.log:49:Traceback (most recent call last):
worker_0.log-50- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
worker_0.log-51- outputs = model.generate(
worker_0.log-52- File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_0.log-53- return func(*args, **kwds)
worker_0.log-54- File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
worker_0.log-60- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_0.log-61- return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
worker_0.log-62- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
worker_0.log-63- _check_input_data_type(input_data)
worker_0.log-64- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_0.log:65: raise TypeError(
worker_0.log:66:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
worker_3.log-43-Sliding Window Attention is enabled but not implemented for eager; unexpected results may be encountered.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:16<00:32, 16.07s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
worker_3.log:48:Traceback (most recent call last):
worker_3.log-49- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
worker_3.log-50- outputs = model.generate(
worker_3.log-51- File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_3.log-52- return func(*args, **kwds)
worker_3.log-53- File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
worker_3.log-59- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_3.log-60- return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
worker_3.log-61- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
worker_3.log-62- _check_input_data_type(input_data)
worker_3.log-63- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_3.log:64: raise TypeError(
worker_3.log:65:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
worker_3.log-66-[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
worker_3.log-67-[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Traceback (most recent call last):
File "/usr/local/bin/msrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 150, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 144, in run
process_manager.run()
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 225, in run
self.join_processes()
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 336, in join_processes
raise RuntimeError("Distributed job exited with exception. Please check logs in "
RuntimeError: Distributed job exited with exception. Please check logs in directory: .
Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Describe the bug/ 问题描述 (Mandatory / 必填)
A clear and concise description of what the bug is.
910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错
Ascend
/GPU
/CPU
) / 硬件环境:-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
Ascend HDK24.1.RC3
CANN 8.0.0
MindSpore 2.5.0
PyNative
/Graph
):msrun --worker_num=4 --local_worker_num=4 --master_port=8118 --join=True --bind_core=True run_llama3_distributed.py
Expected behavior / 预期结果 (Mandatory / 必填)
A clear and concise description of what you expected to happen.
Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 25%|█████████████████████████▌ | 1/4 [00:16<00:49, 16.39s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 25%|█████████████████████████▌ | 1/4 [00:17<00:52, 17.51s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 50%|███████████████████████████████████████████████████ | 2/4 [00:25<00:24, 12.13s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 50%|███████████████████████████████████████████████████ | 2/4 [00:32<00:32, 16.43s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
Traceback (most recent call last):
File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
outputs = model.generate(
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
return mindspore.tensor(token, dtype=mindspore.int64)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
_check_input_data_type(input_data)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:40<00:12, 12.50s/it]low_cpu_mem usage is not avaliable.
[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:43<00:14, 14.47s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
Traceback (most recent call last):
File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
outputs = model.generate(
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
return mindspore.tensor(token, dtype=mindspore.int64)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
_check_input_data_type(input_data)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:312] Worker process 162172 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:318] There's worker exits with exception, kill all other workers.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.720.000 [mindspore/parallel/cluster/process_entity/_api.py:331] Scheduler process 162146 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.721.000 [mindspore/parallel/cluster/process_entity/_api.py:334] Analyzing exception log...
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.722.000 [mindspore/parallel/cluster/process_entity/_api.py:431] Time out nodes are ['3']
scheduler.log-58-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:20.507.092 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-59-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.237 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-60-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.316 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-61-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.452 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-62-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.508 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log:63:[ERROR] DISTRIBUTED(162146,ffff2b7ef120,python):2025-03-10-09:40:31.015.657 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 3 is timed out. It may exit with exception, please check this node's log.
scheduler.log:64:[ERROR] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:35.507.622 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes.
scheduler.log:65:Traceback (most recent call last):
scheduler.log-66- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 7, in
scheduler.log-67- init()
scheduler.log-68- File "/usr/local/lib/python3.10/dist-packages/mindspore/communication/management.py", line 198, in init
scheduler.log-69- init_cluster()
scheduler.log:70:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{3}, worker 3 is the first one timed out, please check its log.
scheduler.log-71-
scheduler.log-72-----------------------------------------------------
scheduler.log-73-- C++ Call Stack: (For framework developers)
scheduler.log-74-----------------------------------------------------
scheduler.log-75-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState
worker_0.log-44-Sliding Window Attention is enabled but not implemented for
eager
; unexpected results may be encountered.Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
worker_0.log:49:Traceback (most recent call last):
worker_0.log-50- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
worker_0.log-51- outputs = model.generate(
worker_0.log-52- File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_0.log-53- return func(*args, **kwds)
worker_0.log-54- File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
worker_0.log-60- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_0.log-61- return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
worker_0.log-62- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
worker_0.log-63- _check_input_data_type(input_data)
worker_0.log-64- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_0.log:65: raise TypeError(
worker_0.log:66:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
worker_3.log-43-Sliding Window Attention is enabled but not implemented for
eager
; unexpected results may be encountered.Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:16<00:32, 16.07s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
worker_3.log:48:Traceback (most recent call last):
worker_3.log-49- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
worker_3.log-50- outputs = model.generate(
worker_3.log-51- File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_3.log-52- return func(*args, **kwds)
worker_3.log-53- File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
worker_3.log-59- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_3.log-60- return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
worker_3.log-61- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
worker_3.log-62- _check_input_data_type(input_data)
worker_3.log-63- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_3.log:64: raise TypeError(
worker_3.log:65:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
worker_3.log-66-[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
worker_3.log-67-[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Traceback (most recent call last):
File "/usr/local/bin/msrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 150, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 144, in run
process_manager.run()
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 225, in run
self.join_processes()
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 336, in join_processes
raise RuntimeError("Distributed job exited with exception. Please check logs in "
RuntimeError: Distributed job exited with exception. Please check logs in directory: .
Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: