[New Model]: Qwen2-VL #246

Yikun · 2025-03-05T15:04:54Z

The model to consider.

https://huggingface.co/Qwen/Qwen2-VL-2B
https://huggingface.co/Qwen/Qwen2-VL-7B

The closest model vllm already supports.

No response

What's your difficulty of supporting the model you want?

No response

Potabk · 2025-03-06T11:12:58Z

Once in a test, I found that qwen2_vl is already supported. Not sure if it is a correct test method and the script is derived from vllm example

python examples/offline_inference/vision_language.py 
INFO 03-06 11:04:11 [__init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 03-06 11:04:11 [__init__.py:32] name=ascend, value=vllm_ascend:register
INFO 03-06 11:04:11 [__init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 03-06 11:04:11 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 03-06 11:04:11 [__init__.py:44] plugin ascend loaded.
INFO 03-06 11:04:11 [__init__.py:247] Platform plugin ascend is activated
INFO 03-06 11:04:26 [config.py:576] This model supports multiple tasks: {'classify', 'reward', 'embed', 'generate', 'score'}. Defaulting to 'generate'.
INFO 03-06 11:04:26 [llm_engine.py:235] Initializing a V0 LLM engine (v0.7.3.dev245+gcd1f843f) with config: model='/root/wl/cache/modelscope/models/Qwen/Qwen2-VL-2B-Instruct', speculative_config=None, tokenizer='/root/wl/cache/modelscope/models/Qwen/Qwen2-VL-2B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/wl/cache/modelscope/models/Qwen/Qwen2-VL-2B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs={'min_pixels': 784, 'max_pixels': 1003520}, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[8,4,2,1],"max_capture_size":8}, use_cached_outputs=False, 
INFO 03-06 11:04:28 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 03-06 11:04:28 [utils.py:2300] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd0fd9bbe0>
INFO 03-06 11:04:36 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
WARNING 03-06 11:04:36 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 03-06 11:04:36 [config.py:3164] cudagraph sizes specified by model runner [1, 2, 4, 8] is overridden by config [8, 1, 2, 4]
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.25it/s]

INFO 03-06 11:04:38 [loader.py:422] Loading weights took 1.12 seconds
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Keyword argument `min_pixels` is not a valid argument for this processor and will be ignored.
Keyword argument `max_pixels` is not a valid argument for this processor and will be ignored.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
/root/wl/oldfiles/vllm-project/vllm/vllm/model_executor/models/qwen2_vl.py:626: UserWarning: current tensor is running as_strided, don't perform inplace operations on the returned value. If you encounter this warning and have precision issues, you can try torch.npu.config.allow_internal_format = False to resolve precision issues. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:123.)
  x = x.unsqueeze(1)
INFO 03-06 11:04:49 [executor_base.py:111] # npu blocks: 112772, # CPU blocks: 9362
INFO 03-06 11:04:49 [executor_base.py:116] Maximum concurrency for 4096 tokens per request: 440.52x
INFO 03-06 11:04:50 [llm_engine.py:441] init engine (profile, create kv cache, warmup model) took 11.37 seconds
WARNING 03-06 11:04:57 [utils.py:1485] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 03-06 11:04:57 [utils.py:1485] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 03-06 11:04:57 [utils.py:1485] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 03-06 11:04:57 [utils.py:1485] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.10s/it, est. speed input: 1158.04 toks/s, output: 54.45 toks/s]
The image shows a close-up view of cherry blossoms in full bloom, with the Tokyo Skytree tower visible through the branches. The sky is clear and blue, and the cherry blossoms are in shades of pink and white, creating a beautiful contrast against the blue sky.
The image shows a beautiful view of a cherry blossom tree with pink flowers in full bloom, set against a clear blue sky. In the background, there is a tall tower with a white and gray structure, possibly a telecommunications or observation tower. The cherry blossoms frame the tower, creating a picturesque and serene scene.
The image shows a close-up view of cherry blossoms in full bloom, with the Tokyo Skytree tower visible through the branches. The sky is clear and blue, and the cherry blossoms are in shades of pink and white. The Tokyo Skytree is a famous landmark in Tokyo, Japan, and is known for its
The image shows a close-up view of cherry blossoms in full bloom, with the Tokyo Skytree tower visible through the branches. The sky is clear and blue, and the cherry blossoms are in shades of pink and white, creating a beautiful contrast against the blue background.

Yikun added the new model label Mar 5, 2025

Yikun mentioned this issue Mar 5, 2025

vLLM Ascend Roadmap Q1 2025 #71

Open

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model]: Qwen2-VL #246

[New Model]: Qwen2-VL #246

Yikun commented Mar 5, 2025

Potabk commented Mar 6, 2025

[New Model]: Qwen2-VL #246

[New Model]: Qwen2-VL #246

Comments

Yikun commented Mar 5, 2025

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Potabk commented Mar 6, 2025