Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Memory Leak or Abnormal Memory Increase When Deploying Fine-Tuned Qwen2VL-72B Model with vLLM Serve #216

Open
XuyaoWang opened this issue Mar 2, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@XuyaoWang
Copy link

Your current environment

The output of `npu-smi info`
root@1c518a2e9ee2:/workspace# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.7                   Version: 23.0.7                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B3               | OK            | 102.8       55                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          3338 / 65536         |
+===========================+===============+====================================================+
| 1     910B3               | OK            | 97.6        54                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          3337 / 65536         |
+===========================+===============+====================================================+
| 2     910B3               | OK            | 96.5        56                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          3334 / 65536         |
+===========================+===============+====================================================+
| 3     910B3               | OK            | 108.7       57                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          3334 / 65536         |
+===========================+===============+====================================================+
| 4     910B3               | OK            | 103.9       58                0    / 0             |
| 0                         | 0000:01:00.0  | 0           0    / 0          3334 / 65536         |
+===========================+===============+====================================================+
| 5     910B3               | OK            | 99.3        56                0    / 0             |
| 0                         | 0000:02:00.0  | 0           0    / 0          3334 / 65536         |
+===========================+===============+====================================================+
| 6     910B3               | OK            | 107.7       58                0    / 0             |
| 0                         | 0000:41:00.0  | 0           0    / 0          3334 / 65536         |
+===========================+===============+====================================================+
| 7     910B3               | OK            | 105.1       59                0    / 0             |
| 0                         | 0000:42:00.0  | 0           0    / 0          3338 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 0                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+
The output of `cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info`
root@1c518a2e9ee2:/workspace# cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
package_name=Ascend-cann-toolkit
version=8.0.0
innerversion=V100R001C20SPC001B251
compatible_version=[V100R001C15],[V100R001C17],[V100R001C18],[V100R001C19],[V100R001C20]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.0.0/aarch64-linux
The output of `python collect_env.py`
root@1c518a2e9ee2:/workspace# python collect_env.py
INFO 02-27 17:19:50 __init__.py:28] Available plugins for group vllm.platform_plugins:
INFO 02-27 17:19:50 __init__.py:30] name=ascend, value=vllm_ascend:register
INFO 02-27 17:19:50 __init__.py:32] all available plugins for group vllm.platform_plugins will be loaded.
INFO 02-27 17:19:50 __init__.py:34] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 02-27 17:19:50 __init__.py:42] plugin ascend loaded.
INFO 02-27 17:19:51 __init__.py:28] Available plugins for group vllm.platform_plugins:
INFO 02-27 17:19:51 __init__.py:30] name=ascend, value=vllm_ascend:register
INFO 02-27 17:19:51 __init__.py:32] all available plugins for group vllm.platform_plugins will be loaded.
INFO 02-27 17:19:51 __init__.py:34] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 02-27 17:19:51 __init__.py:42] plugin ascend loaded.
INFO 02-27 17:19:51 __init__.py:187] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 02-27 17:19:51 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 02-27 17:19:51 __init__.py:174] Platform plugin ascend is activated
Collecting environment information...
PyTorch version: 2.5.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.15 (main, Nov 27 2024, 06:51:55) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.0-60.139.0.166.oe2203.aarch64-aarch64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       aarch64
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          HiSilicon
Model name:                         Kunpeng-920
Model:                              0
Thread(s) per core:                 1
Core(s) per cluster:                48
Socket(s):                          -
Cluster(s):                         4
Stepping:                           0x1
Frequency boost:                    disabled
CPU max MHz:                        2600.0000
CPU min MHz:                        200.0000
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                          12 MiB (192 instances)
L1i cache:                          12 MiB (192 instances)
L2 cache:                           96 MiB (192 instances)
L3 cache:                           192 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-23
NUMA node1 CPU(s):                  24-47
NUMA node2 CPU(s):                  48-71
NUMA node3 CPU(s):                  72-95
NUMA node4 CPU(s):                  96-119
NUMA node5 CPU(s):                  120-143
NUMA node6 CPU(s):                  144-167
NUMA node7 CPU(s):                  168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torch-npu==2.5.1.dev20250218
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

LD_LIBRARY_PATH=/usr/local/python3.10/lib/python3.10/site-packages/cv2/../../lib64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1

🐛 Describe the bug

When deploying the fine-tuned Qwen2VL-72B model using vllm serve, the memory usage will abnormally increase after the model is loaded. Once it consumes all the memory on the server, it throws an error and exits.

The original serving script is:

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
source /usr/local/Ascend/ascend-toolkit/set_env.sh

vllm serve <path_to_finetuned_qwen2vl-72b> \
--served-model-name finetuned_qwen2vl-72b \
--tensor-parallel-size 8 \
--distributed_executor_backend "mp"

When the program hangs, the loading interface, memory usage, and GPU memory usage are as follows:

Image

The error message when the program crashes:

error_message.log

The debugging script is:

export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_TRACE_FUNCTION=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
source /usr/local/Ascend/ascend-toolkit/set_env.sh

vllm serve <path_to_finetuned_qwen2vl-72b> \
--served-model-name finetuned_qwen2vl-72b \
--tensor-parallel-size 8 \
--distributed_executor_backend "mp"

The detailed log is:

ascend_detailed_log.log

@XuyaoWang XuyaoWang added the bug Something isn't working label Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant