[Bug]: When using the VLLM framework to load visual models, CPU memory overflow occurs while continuously processing data with images. #12973

woshiwanlei1 · 2025-02-09T02:48:16Z

The problem I encountered

After deploying Qwen2-VL-7B-Instruct-GPTQ-Int4 using VLLM, continuous requests from clients will cause CPU memory to continue to rise. Is it because some memory has not been reclaimed?

My specific usage scenario is:
I have two GPUs. When I use the ray framework for distributed deployment, as the number of VL models processed increases, my CPU memory becomes larger, leading to actor crashes in ray.

I have tested the native loading method of Qwen2-VL-7B-Instruct-GPTQ-Int4 and it does not cause CPU memory overflow. Once the VLLM framework is used for loading, there will be continuous CPU overflow

[Special note]: When you test, be sure to change the image each time, so that you can clearly see the CPU memory overflow. If only the same image is used, it will only leak once, causing the memory overflow to appear inconspicuous.

My code and environment

Here is my code

def getMessage(pic_file):
    messages = [{'role': 'system', 'content': 'You are a very useful assistant, please strictly follow the requirements to complete the task!'}, {'role': 'user', 'content': [{'type': 'image_url', 'image_url': pic_file, 'min_pixels': 50176, 'max_pixels': 1411200}, {'type': 'text', 'text': 'Don't worry about the prompt words here, they are just examples'}]}]
    return messages

def vllm_extract_text(result_list,model_path,temperature,top_p,max_token,min_pixels,max_pixels):
    os.environ["CUDA_VISIBLE_DEVICES"] ="0"
    model_path = "/mnt/data/programdata/vl_model/Qwen2-VL-7B-Instruct-GPTQ-Int4"
    llm = LLM(model=model_path, limit_mm_per_prompt={"image": 5, "video": 0})
    sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_token, stop_token_ids=[])
    processor = AutoProcessor.from_pretrained(model_path, min_pixels=min_pixels, max_pixels=max_pixels)
    
    #Ignore result_list, they are the return data of MongoDB
    for doc in result_list:
          messages = getMessage(doc['pic'])
          text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
          image_inputs, _ = process_vision_info(messages)
          mm_data = {}
          if image_inputs is not None:
              mm_data["image"] = image_inputs
          llm_inputs = {
              "prompt": text,
              "multi_modal_data": mm_data,
          }
          outputs = llm.generate([llm_inputs], sampling_params=sampling_params, use_tqdm=False)
          for output in outputs:
              generated_text = output.outputs[0].text
          
          del llm_inputs,outputs

This is vllm version information

Name: vllm
Version: 0.7.2

This is my gpu info

This is memory leak information

The text was updated successfully, but these errors were encountered:

ywang96 · 2025-03-04T16:01:33Z

@woshiwanlei1 Can you try specifying disable_mm_preprocessor_cache=True and see if the host memory overflow issue still persists?

woshiwanlei1 · 2025-03-04T23:50:00Z

@woshiwanlei1 Can you try specifying disable_mm_preprocessor_cache=True and see if the host memory overflow issue still persists?

I'll come back and give it a try, thank you.
I found out later that the leakage of this memory doesn't seem to be unlimited. It seems that if it leaks to a certain limit, he won't leak anymore. I installed multiple memory modules at the back, which solved this problem.

DarkLight1337 · 2025-03-06T14:43:39Z

Can you also try out #14336 and see if it alleviates the issue?

woshiwanlei1 added the bug Something isn't working label Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: When using the VLLM framework to load visual models, CPU memory overflow occurs while continuously processing data with images. #12973

[Bug]: When using the VLLM framework to load visual models, CPU memory overflow occurs while continuously processing data with images. #12973

woshiwanlei1 commented Feb 9, 2025 •

edited

Loading

ywang96 commented Mar 4, 2025 •

edited

Loading

woshiwanlei1 commented Mar 4, 2025

DarkLight1337 commented Mar 6, 2025

[Bug]: When using the VLLM framework to load visual models, CPU memory overflow occurs while continuously processing data with images. #12973

[Bug]: When using the VLLM framework to load visual models, CPU memory overflow occurs while continuously processing data with images. #12973

Comments

woshiwanlei1 commented Feb 9, 2025 • edited Loading

The problem I encountered

My code and environment

Here is my code

This is vllm version information

This is my gpu info

This is memory leak information

ywang96 commented Mar 4, 2025 • edited Loading

woshiwanlei1 commented Mar 4, 2025

DarkLight1337 commented Mar 6, 2025

woshiwanlei1 commented Feb 9, 2025 •

edited

Loading

ywang96 commented Mar 4, 2025 •

edited

Loading