Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Installation]: Attempting to build and run vLLM for Intel Core Ultra 7 155H with ARC iGPU #14295

Open
1 task done
cgruver opened this issue Mar 5, 2025 · 1 comment
Open
1 task done
Labels
installation Installation problems

Comments

@cgruver
Copy link

cgruver commented Mar 5, 2025

The build seems to complete with no issues.

The server runs with some odd stdout logging.

CURL does not return valid responses, but gets a 200 from the server.

Working notes are here - https://github.com/cgruver/vllm-intel-gpu-workspace

Your current environment

python collect_env.py

/projects/home/.local/lib/python3.12/site-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'libjpeg.so.8: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[W305 15:15:04.737001522 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
INFO 03-05 15:15:05 [__init__.py:253] Automatically detected platform xpu.
Collecting environment information...
PyTorch version: 2.5.1+cxx11.abi
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Fedora Linux 41 (Container Image) (x86_64)
GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Clang version: Could not collect
CMake version: version 3.30.8
Libc version: glibc-2.40

Python version: 3.12.9 (main, Feb  4 2025, 00:00:00) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime)
Python platform: Linux-5.14.0-427.57.1.el9_4.x86_64-x86_64-with-glibc2.40
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               22
On-line CPU(s) list:                  0-21
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Core(TM) Ultra 7 155H
CPU family:                           6
Model:                                170
Thread(s) per core:                   2
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             4
CPU(s) scaling MHz:                   86%
CPU max MHz:                          4700.0000
CPU min MHz:                          400.0000
BogoMIPS:                             5990.40
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid bus_lock_detect movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                       VT-x
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-21
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.5.10+xpu
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1+cxx11.abi
[pip3] torchaudio==2.5.1+cxx11.abi
[pip3] torchvision==0.20.1+cxx11.abi
[pip3] transformers==4.49.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.1.dev4910+g3610fb4.d20250305 (git sha: 3610fb4.d20250305
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

VLLM_TRACE_FUNCTION=1
LD_LIBRARY_PATH=/projects/home/.local/lib/python3.12/site-packages/cv2/../../lib64:/opt/intel/oneapi/tcm/1.2/lib:/opt/intel/oneapi/umf/0.9/lib:/opt/intel/oneapi/tbb/2022.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mkl/2025.0/lib:/opt/intel/oneapi/dnnl/2025.0/lib:/opt/intel/oneapi/debugger/2025.0/opt/debugger/lib:/opt/intel/oneapi/compiler/2025.0/opt/compiler/lib:/opt/intel/oneapi/compiler/2025.0/lib:/checode/checode-linux-libc/ubi9/ld_libs:
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1

How you are installing vllm

Build from source -

OS Fedora 41

HW - Core Ultra 7 155H

Installed Packages -

[oneAPI]
name=Intel® oneAPI repository
baseurl=https://yum.repos.intel.com/oneapi
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
dnf install -y g++ cmake git libcurl-devel intel-oneapi-mkl-sycl-devel intel-oneapi-dnnl-devel intel-oneapi-compiler-dpcpp-cpp intel-level-zero oneapi-level-zero oneapi-level-zero-devel intel-compute-runtime procps-ng python3.12 python3.12-devel lspci clinfo openssl libbrotli git tar gzip zip xz unzip which shadow-utils bash zsh vi wget jq podman buildah skopeo podman-docker ca-certificates fuse-overlayfs util-linux vim-minimal vim-enhanced awk libpng-devel libjpeg-devel libfabric-devel

Install XPU dependencies

python -m pip install --upgrade pip

python -m pip install torch==2.5.1+cxx11.abi torchvision==0.20.1+cxx11.abi torchaudio==2.5.1+cxx11.abi intel-extension-for-pytorch==2.5.10+xpu oneccl_bind_pt==2.5.0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

Fix requirements-cpu.txt

cat << EOF > requirements-xpu.txt
-r requirements-common.txt

ray>=2.9
cmake>=3.26
ninja
packaging
setuptools-scm>=8
setuptools>=75.8.0
wheel
jinja2
EOF

Fix apparent issue with args passed to torch.xpu.varlen_fwd #11173 (comment)

sed -i '326d' ${HOME}/.local/lib/python3.12/site-packages/intel_extension_for_pytorch/transformers/models/reference/fusions/mha_fusion.py

Build vLLM

VLLM_TARGET_DEVICE=xpu python -m pip install .

Run vLLM Server

vllm serve --host 0.0.0.0 --port 8080 --device xpu --gpu-memory-utilization 0.3 Qwen/Qwen2.5-1.5B-Instruct

STDOUT -

/projects/home/.local/lib/python3.12/site-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'libjpeg.so.8: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[W305 15:08:56.830644723 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
INFO 03-05 15:08:57 [__init__.py:253] Automatically detected platform xpu.
INFO 03-05 15:08:57 [api_server.py:912] vLLM API server version 0.1.dev4910+g3610fb4.d20250305
INFO 03-05 15:08:57 [api_server.py:913] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2.5-1.5B-Instruct', config='', host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-1.5B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.3, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f5d9c6496c0>)
INFO 03-05 15:08:57 [api_server.py:209] Started engine process with PID 96388
/projects/home/.local/lib/python3.12/site-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'libjpeg.so.8: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[W305 15:09:00.515285243 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
INFO 03-05 15:09:00 [__init__.py:253] Automatically detected platform xpu.
INFO 03-05 15:09:02 [config.py:576] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 03-05 15:09:02 [_logger.py:68] bfloat16 is only supported on Intel Data Center GPU, Intel Arc GPU is not supported yet. Your device is Intel(R) Arc(TM) Graphics, which is not supported. will fallback to float16
WARNING 03-05 15:09:02 [_logger.py:68] CUDA graph is not supported on XPU, fallback to the eager mode.
WARNING 03-05 15:09:02 [_logger.py:68] uni is not supported on XPU, fallback to ray distributed executor backend.
INFO 03-05 15:09:06 [config.py:576] This model supports multiple tasks: {'reward', 'classify', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
WARNING 03-05 15:09:06 [_logger.py:68] bfloat16 is only supported on Intel Data Center GPU, Intel Arc GPU is not supported yet. Your device is Intel(R) Arc(TM) Graphics, which is not supported. will fallback to float16
WARNING 03-05 15:09:06 [_logger.py:68] CUDA graph is not supported on XPU, fallback to the eager mode.
WARNING 03-05 15:09:06 [_logger.py:68] uni is not supported on XPU, fallback to ray distributed executor backend.
INFO 03-05 15:09:06 [llm_engine.py:235] Initializing a V0 LLM engine (v0.1.dev4910+g3610fb4.d20250305) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
WARNING 03-05 15:09:06 [_logger.py:68] No existing RAY instance detected. A new instance will be launched with current node resources.
2025-03-05 15:09:07,130 WARNING _logger.py:68 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 16.5 to 16.
2025-03-05 15:09:07,134 WARNING _logger.py:68 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67018752 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-03-05 15:09:08,181 INFO worker.py:1841 -- Started a local Ray instance.
INFO 03-05 15:09:08 [ray_distributed_executor.py:171] use_ray_spmd_worker: False
(pid=96765) /projects/home/.local/lib/python3.12/site-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'libjpeg.so.8: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=96765)   warn(
(pid=96765) [W305 15:09:10.349097245 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
(pid=96765)   Overriding a previously registered kernel for the same operator and the same dispatch key
(pid=96765)   operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
(pid=96765)     registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
(pid=96765)   dispatch key: XPU
(pid=96765)   previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
(pid=96765)        new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
(pid=96765) INFO 03-05 15:09:11 [__init__.py:253] Automatically detected platform xpu.
INFO 03-05 15:09:11 [ray_distributed_executor.py:345] non_carry_over_env_vars from config: set()
INFO 03-05 15:09:11 [ray_distributed_executor.py:347] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_TRACE_FUNCTION', 'VLLM_WORKER_MULTIPROC_METHOD']
INFO 03-05 15:09:11 [ray_distributed_executor.py:350] If certain env vars should NOT be copied to workers, add them to /projects/home/.config/vllm/ray_non_carry_over_env_vars.json file
WARNING 03-05 15:09:11 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 03-05 15:09:11 [logger.py:206] Trace frame log is saved to /tmp/user/vllm/vllm-instance-ccf8f/VLLM_TRACE_FUNCTION_for_process_96388_thread_140616860819072_at_2025-03-05_15:09:11.828344.log
INFO 03-05 15:09:12 [xpu.py:35] Cannot use None backend on XPU.
INFO 03-05 15:09:12 [xpu.py:36] Using IPEX attention backend.
WARNING 03-05 15:09:12 [_logger.py:68] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 03-05 15:09:12 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 03-05 15:09:12 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
2025:03:05-15:09:12:(96388) |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2025:03:05-15:09:12:(96388) |CCL_WARN| value of CCL_LOCAL_RANK changed to be 0 (default:-1)
2025:03:05-15:09:12:(96388) |CCL_WARN| value of CCL_LOCAL_SIZE changed to be 1 (default:-1)
2025:03:05-15:09:12:(96388) |CCL_WARN| value of CCL_PROCESS_LAUNCHER changed to be none (default:hydra)
2025:03:05-15:09:12:(96388) |CCL_WARN| device_family is unknown, topology discovery could be incorrect, it might result in suboptimal performance
INFO 03-05 15:09:14 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 03-05 15:09:14 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.36it/s]

INFO 03-05 15:09:15 [loader.py:422] Loading weights took 0.79 seconds
WARNING 03-05 15:09:15 [_logger.py:68] Pin memory is not supported on XPU.
INFO 03-05 15:09:15 [xpu_model_runner.py:425] Loading model weights took 2.8875 GB

Note: The following is logged several times -

Unsupported gpu_arch of fmha_forward!!

Unsupported gpu_arch of fmha_forward!!

Then -

INFO 03-05 15:09:38 [executor_base.py:111] # xpu blocks: 54785, # CPU blocks: 9362
INFO 03-05 15:09:38 [executor_base.py:116] Maximum concurrency for 32768 tokens per request: 26.75x
INFO 03-05 15:09:41 [llm_engine.py:441] init engine (profile, create kv cache, warmup model) took 26.41 seconds
INFO 03-05 15:09:57 [api_server.py:958] Starting vLLM API server on http://0.0.0.0:8080
INFO 03-05 15:09:57 [launcher.py:26] Available routes are:
INFO 03-05 15:09:57 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 03-05 15:09:57 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 03-05 15:09:57 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-05 15:09:57 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 03-05 15:09:57 [launcher.py:34] Route: /health, Methods: GET
INFO 03-05 15:09:57 [launcher.py:34] Route: /ping, Methods: POST, GET
INFO 03-05 15:09:57 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 03-05 15:09:57 [launcher.py:34] Route: /version, Methods: GET
INFO 03-05 15:09:57 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /pooling, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /score, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /rerank, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 03-05 15:09:57 [launcher.py:34] Route: /invocations, Methods: POST
INFO:     Started server process [96344]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Test -

curl -k https://cgruver-vllm-intel-gpu-vllm.apps.region-01.clg.lab/v1/completions -X POST -H "Content-Type: application/json" -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","prompt": "San Francisco is a","max_tokens": 7,"temperature": 0}'
{"id":"cmpl-12ee6303c8df488aa5bc8bcf657093c8","object":"text_completion","created":1741187507,"model":"Qwen/Qwen2.5-1.5B-Instruct","choices":[{"index":0,"text":" pitch!!!!!!","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":4,"total_tokens":11,"completion_tokens":7,"prompt_tokens_details":null}}

STDOUT -

INFO 03-05 15:11:47 [logger.py:39] Received request cmpl-12ee6303c8df488aa5bc8bcf657093c8-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
INFO 03-05 15:11:47 [engine.py:289] Added request cmpl-12ee6303c8df488aa5bc8bcf657093c8-0.

Note: The following is logged several times -


Unsupported gpu_arch of fmha_forward!!

Unsupported gpu_arch of fmha_forward!!

Unsupported gpu_arch of fmha_forward!!

Unsupported gpu_arch of fmha_forward!!

Unsupported gpu_arch of paged_attention_v1!!

Unsupported gpu_arch of paged_attention_v1!!

Unsupported gpu_arch of paged_attention_v1!!

Unsupported gpu_arch of paged_attention_v1!!

Then -

INFO:     10.100.0.2:48078 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 03-05 15:11:59 [metrics.py:470] Avg prompt throughput: 0.3 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 03-05 15:12:09 [metrics.py:470] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@cgruver cgruver added the installation Installation problems label Mar 5, 2025
@jikunshang
Copy link
Contributor

Unfortunately, you are using Meteor Lake, its gpu arch is not supported for page attention kernel in ipex. so vllm cannot run on this platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
installation Installation problems
Projects
None yet
Development

No branches or pull requests

2 participants