We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
现在跑会报错,把一张卡device给mount到两个容器里,就报错。 一张910b跑一个向量模型有点亏啊。
[EVENT] PROFILING(618,python3):2025-03-06-01:09:06.978.203 [msprof_callback_impl.cpp:336] >>> (tid:618) Started to register profiling ctrl callback. [EVENT] PROFILING(618,python3):2025-03-06-01:09:06.978.556 [msprof_callback_impl.cpp:343] >>> (tid:618) Started to register profiling hash id callback. [INFO] PROFILING(618,python3):2025-03-06-01:09:06.978.670 [prof_atls_plugin.cpp:117] (tid:618) RegisterProfileCallback, callback type is 7 [EVENT] PROFILING(618,python3):2025-03-06-01:09:06.978.759 [msprof_callback_impl.cpp:350] >>> (tid:618) Started to register profiling enable host freq callback. [INFO] PROFILING(618,python3):2025-03-06-01:09:06.978.842 [prof_atls_plugin.cpp:117] (tid:618) RegisterProfileCallback, callback type is 8 [ERROR] ATRACE(618,python3):2025-03-06-01:09:06.981.451 trace_driver_api.c:56 get platform info failed, drvErr=87. [INFO] RUNTIME(618,python3):2025-03-06-01:09:06.985.224 [task_fail_callback_manager.cc:52] 618 TaskFailCallBackManager: Constructor. [INFO] HCCL(618,python3):2025-03-06-01:09:07.071.734 [adapter_rts.cc:2646][618][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] [ERROR] RUNTIME(618,python3):2025-03-06-01:09:07.260.779 [runtime.cc:1879]618 CheckHaveDevice:Call halGetDeviceInfo failed: drvRet=87, module type=0, info type=1. [INFO] PROFILING(618,python3):2025-03-06-01:09:07.261.136 [prof_atls_plugin.cpp:210] (tid:618) Module[7] register callback of ctrl handle. [ERROR] RUNTIME(618,python3):2025-03-06-01:09:07.262.154 [driver.cc:65]618 GetDeviceCount:Call drvGetDevNum, drvRetCode=87. [ERROR] RUNTIME(618,python3):2025-03-06-01:09:07.262.277 [api_c_device.cc:23]618 rtGetDeviceCount:ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020010 [ERROR] RUNTIME(618,python3):2025-03-06-01:09:07.262.373 [error_message_manage.cc:53]618 FuncErrorReason:report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(618,python3):2025-03-06-01:09:07.262.459 [error_message_manage.cc:53]618 FuncErrorReason:rtGetDeviceCount execute failed, reason=[driver error:internal error] [ERROR] ASCENDCL(618,python3):2025-03-06-01:09:07.262.594 [device.cpp:342]618 aclrtGetDeviceCount: get device count failed, runtime result = 507899. [ERROR] APP(618,python3):2025-03-06-01:09:07.262.746 [log_inner.cpp:76]618 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:device_count:25: "[PTA]:"get device count of NPU failed""
No response
The text was updated successfully, but these errors were encountered:
No branches or pull requests
🚀 The feature, motivation and pitch
现在跑会报错,把一张卡device给mount到两个容器里,就报错。
一张910b跑一个向量模型有点亏啊。
[EVENT] PROFILING(618,python3):2025-03-06-01:09:06.978.203 [msprof_callback_impl.cpp:336] >>> (tid:618) Started to register profiling ctrl callback.
[EVENT] PROFILING(618,python3):2025-03-06-01:09:06.978.556 [msprof_callback_impl.cpp:343] >>> (tid:618) Started to register profiling hash id callback.
[INFO] PROFILING(618,python3):2025-03-06-01:09:06.978.670 [prof_atls_plugin.cpp:117] (tid:618) RegisterProfileCallback, callback type is 7
[EVENT] PROFILING(618,python3):2025-03-06-01:09:06.978.759 [msprof_callback_impl.cpp:350] >>> (tid:618) Started to register profiling enable host freq callback.
[INFO] PROFILING(618,python3):2025-03-06-01:09:06.978.842 [prof_atls_plugin.cpp:117] (tid:618) RegisterProfileCallback, callback type is 8
[ERROR] ATRACE(618,python3):2025-03-06-01:09:06.981.451 trace_driver_api.c:56 get platform info failed, drvErr=87.
[INFO] RUNTIME(618,python3):2025-03-06-01:09:06.985.224 [task_fail_callback_manager.cc:52] 618 TaskFailCallBackManager: Constructor.
[INFO] HCCL(618,python3):2025-03-06-01:09:07.071.734 [adapter_rts.cc:2646][618][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1]
[ERROR] RUNTIME(618,python3):2025-03-06-01:09:07.260.779 [runtime.cc:1879]618 CheckHaveDevice:Call halGetDeviceInfo failed: drvRet=87, module type=0, info type=1.
[INFO] PROFILING(618,python3):2025-03-06-01:09:07.261.136 [prof_atls_plugin.cpp:210] (tid:618) Module[7] register callback of ctrl handle.
[ERROR] RUNTIME(618,python3):2025-03-06-01:09:07.262.154 [driver.cc:65]618 GetDeviceCount:Call drvGetDevNum, drvRetCode=87.
[ERROR] RUNTIME(618,python3):2025-03-06-01:09:07.262.277 [api_c_device.cc:23]618 rtGetDeviceCount:ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020010
[ERROR] RUNTIME(618,python3):2025-03-06-01:09:07.262.373 [error_message_manage.cc:53]618 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(618,python3):2025-03-06-01:09:07.262.459 [error_message_manage.cc:53]618 FuncErrorReason:rtGetDeviceCount execute failed, reason=[driver error:internal error]
[ERROR] ASCENDCL(618,python3):2025-03-06-01:09:07.262.594 [device.cpp:342]618 aclrtGetDeviceCount: get device count failed, runtime result = 507899.
[ERROR] APP(618,python3):2025-03-06-01:09:07.262.746 [log_inner.cpp:76]618 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:device_count:25: "[PTA]:"get device count of NPU failed""
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: