Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenVINO: added CPU-like conditions #14338

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ilya-lavrenov
Copy link
Contributor

@ilya-lavrenov ilya-lavrenov commented Mar 6, 2025

Otherwise, we have issues like

E               INFO 03-06 06:47:45 [__init__.py:253] Automatically detected platform openvino.
E               
E               Running healthcheck for model: meta-llama/Llama-2-7b-chat-hf and modeling: optimum
E               INFO 03-06 06:47:46 [config.py:2545] Upcasting torch.float16 to torch.float32.
E               INFO 03-06 06:47:54 [config.py:576] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
E               WARNING 03-06 06:47:54 [openvino.py:87] CUDA graph is not supported on OpenVINO backend, fallback to the eager mode.
E               INFO 03-06 06:47:54 [openvino.py:121] OpenVINO CPU optimal block size is 32, overriding currently set 16
E               WARNING 03-06 06:47:54 [openvino.py:136] Environment variable VLLM_OPENVINO_KVCACHE_SPACE (GB) for OpenVINO backend is not set, using 4 by default.
E               INFO 03-06 06:47:54 [llm_engine.py:235] Initializing a V0 LLM engine (vdev) with config: model='meta-llama/Llama-2-7b-chat-hf', speculative_config=None, tokenizer='meta-llama/Llama-2-7b-chat-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float32, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=<Type: 'float16'>,  device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-2-7b-chat-hf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
E               INFO 03-06 06:47:58 [openvino.py:38] Cannot use None backend on OpenVINO.
E               INFO 03-06 06:47:58 [openvino.py:39] Using OpenVINO Attention backend.
E               INFO 03-06 06:47:58 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
E               WARNING 03-06 06:47:58 [openvino.py:114] Provided model id meta-llama/Llama-2-7b-chat-hf does not contain OpenVINO IR, the model will be converted to IR with default options. If you need to use specific options for model conversion, use optimum-cli export openvino with desired options.
E               INFO 03-06 06:48:26 [executor_base.py:111] # openvino blocks: 256, # CPU blocks: 0
E               INFO 03-06 06:48:26 [executor_base.py:116] Maximum concurrency for 1024 tokens per request: 8.00x
E               INFO 03-06 06:48:27 [llm_engine.py:441] init engine (profile, create kv cache, warmup model) took 1.41 seconds
E               WARNING 03-06 06:48:29 [openvino.py:64] Pin memory is not supported on OpenViNO.
E               Error occurred in HF segment: Torch not compiled with CUDA enabled

Copy link

github-actions bot commented Mar 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant