OpenVINO: added CPU-like conditions #14338

ilya-lavrenov · 2025-03-06T08:14:56Z

Otherwise, we have issues like

E               INFO 03-06 06:47:45 [__init__.py:253] Automatically detected platform openvino.
E               
E               Running healthcheck for model: meta-llama/Llama-2-7b-chat-hf and modeling: optimum
E               INFO 03-06 06:47:46 [config.py:2545] Upcasting torch.float16 to torch.float32.
E               INFO 03-06 06:47:54 [config.py:576] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
E               WARNING 03-06 06:47:54 [openvino.py:87] CUDA graph is not supported on OpenVINO backend, fallback to the eager mode.
E               INFO 03-06 06:47:54 [openvino.py:121] OpenVINO CPU optimal block size is 32, overriding currently set 16
E               WARNING 03-06 06:47:54 [openvino.py:136] Environment variable VLLM_OPENVINO_KVCACHE_SPACE (GB) for OpenVINO backend is not set, using 4 by default.
E               INFO 03-06 06:47:54 [llm_engine.py:235] Initializing a V0 LLM engine (vdev) with config: model='meta-llama/Llama-2-7b-chat-hf', speculative_config=None, tokenizer='meta-llama/Llama-2-7b-chat-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float32, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=<Type: 'float16'>,  device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-2-7b-chat-hf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
E               INFO 03-06 06:47:58 [openvino.py:38] Cannot use None backend on OpenVINO.
E               INFO 03-06 06:47:58 [openvino.py:39] Using OpenVINO Attention backend.
E               INFO 03-06 06:47:58 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
E               WARNING 03-06 06:47:58 [openvino.py:114] Provided model id meta-llama/Llama-2-7b-chat-hf does not contain OpenVINO IR, the model will be converted to IR with default options. If you need to use specific options for model conversion, use optimum-cli export openvino with desired options.
E               INFO 03-06 06:48:26 [executor_base.py:111] # openvino blocks: 256, # CPU blocks: 0
E               INFO 03-06 06:48:26 [executor_base.py:116] Maximum concurrency for 1024 tokens per request: 8.00x
E               INFO 03-06 06:48:27 [llm_engine.py:441] init engine (profile, create kv cache, warmup model) took 1.41 seconds
E               WARNING 03-06 06:48:29 [openvino.py:64] Pin memory is not supported on OpenViNO.
E               Error occurred in HF segment: Torch not compiled with CUDA enabled

github-actions · 2025-03-06T08:15:08Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Ilya Lavrenov <[email protected]>

ilya-lavrenov force-pushed the openvino branch from 9f617eb to aab6b5e Compare March 6, 2025 08:21

OpenVINO: added CPU-like conditions

0a6c36e

Signed-off-by: Ilya Lavrenov <[email protected]>

ilya-lavrenov force-pushed the openvino branch from aab6b5e to 0a6c36e Compare March 6, 2025 08:22

ilya-lavrenov mentioned this pull request Mar 6, 2025

Fix missing kv_caches and attn_metadata in OpenVINOCausalLM #14271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenVINO: added CPU-like conditions #14338

OpenVINO: added CPU-like conditions #14338

ilya-lavrenov commented Mar 6, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 6, 2025

OpenVINO: added CPU-like conditions #14338

Are you sure you want to change the base?

OpenVINO: added CPU-like conditions #14338

Conversation

ilya-lavrenov commented Mar 6, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 6, 2025

ilya-lavrenov commented Mar 6, 2025 •

edited by github-actions bot

Loading