Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Installation]: XPU dependencies are missing #11173

Open
1 task done
pepijndevos opened this issue Dec 13, 2024 · 8 comments
Open
1 task done

[Installation]: XPU dependencies are missing #11173

pepijndevos opened this issue Dec 13, 2024 · 8 comments
Labels
installation Installation problems

Comments

@pepijndevos
Copy link

Your current environment

[W1213 12:52:10.163702538 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
Collecting environment information...
PyTorch version: 2.5.1+cxx11.abi
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Arch Linux (x86_64)
GCC version: (GCC) 14.2.1 20240910
Clang version: 18.1.8
CMake version: version 3.31.1
Libc version: glibc-2.40

Python version: 3.10.16 | packaged by conda-forge | (main, Dec  5 2024, 14:16:10) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.12.4-arch1-1-x86_64-with-glibc2.40
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen 9 7950X 16-Core Processor
CPU family:                           25
Model:                                97
Thread(s) per core:                   2
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             2
Frequency boost:                      enabled
CPU(s) scaling MHz:                   38%
CPU max MHz:                          4501,0000
CPU min MHz:                          400,0000
BogoMIPS:                             9004,83
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze
Virtualization:                       AMD-V
L1d cache:                            512 KiB (16 instances)
L1i cache:                            512 KiB (16 instances)
L2 cache:                             16 MiB (16 instances)
L3 cache:                             64 MiB (2 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-31
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.5.10+xpu
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1+cxx11.abi
[pip3] transformers==4.47.0
[pip3] triton-xpu==3.0.0b1
[conda] intel-extension-for-pytorch 2.5.10+xpu               pypi_0    pypi
[conda] mkl                       2025.0.1                 pypi_0    pypi
[conda] mkl-dpcpp                 2025.0.1                 pypi_0    pypi
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] onemkl-sycl-blas          2025.0.1                 pypi_0    pypi
[conda] onemkl-sycl-datafitting   2025.0.1                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.0.1                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.0.1                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.0.1                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.0.1                 pypi_0    pypi
[conda] onemkl-sycl-stats         2025.0.1                 pypi_0    pypi
[conda] onemkl-sycl-vm            2025.0.1                 pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.5.1+cxx11.abi          pypi_0    pypi
[conda] transformers              4.47.0                   pypi_0    pypi
[conda] triton-xpu                3.0.0b1                  pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post2.dev351+g969da7d7.d20241213
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

LD_LIBRARY_PATH=/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/cv2/../../lib64:/opt/intel/oneapi/tcm/1.2/lib:/opt/intel/oneapi/umf/0.9/lib:/opt/intel/oneapi/tbb/2022.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/pti/0.10/lib:/opt/intel/oneapi/mpi/2021.14/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.14/lib:/opt/intel/oneapi/mkl/2025.0/lib:/opt/intel/oneapi/ippcp/2025.0/lib/:/opt/intel/oneapi/ipp/2022.0/lib:/opt/intel/oneapi/dnnl/2025.0/lib:/opt/intel/oneapi/debugger/2025.0/opt/debugger/lib:/opt/intel/oneapi/dal/2025.0/lib:/opt/intel/oneapi/compiler/2025.0/opt/compiler/lib:/opt/intel/oneapi/compiler/2025.0/lib:/opt/intel/oneapi/ccl/2021.14/lib/:/home/pepijn/mambaforge/envs/vllm/lib/libfabric:


How you are installing vllm

conda create -n "vllm" python=3.10
conda activate vllm
pip install -v -r requirements-xpu.txt
Using pip 24.3.1 from /home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/pip (python 3.10)
Ignoring fastapi: markers 'python_version < "3.9"' don't match your environment
Ignoring six: markers 'python_version > "3.11"' don't match your environment
Ignoring setuptools: markers 'python_version > "3.11"' don't match your environment
Collecting torch@ https://intel-extension-for-pytorch.s3.us-east-1.amazonaws.com/ipex_dev/xpu/torch-2.5.0a0%2Bgite84e33f-cp310-cp310-linux_x86_64.whl (from -r requirements-xpu.txt (line 12))
  ERROR: HTTP error 403 while getting https://intel-extension-for-pytorch.s3.us-east-1.amazonaws.com/ipex_dev/xpu/torch-2.5.0a0%2Bgite84e33f-cp310-cp310-linux_x86_64.whl
ERROR: Could not install requirement torch@ https://intel-extension-for-pytorch.s3.us-east-1.amazonaws.com/ipex_dev/xpu/torch-2.5.0a0%2Bgite84e33f-cp310-cp310-linux_x86_64.whl from https://intel-extension-for-pytorch.s3.us-east-1.amazonaws.com/ipex_dev/xpu/torch-2.5.0a0%2Bgite84e33f-cp310-cp310-linux_x86_64.whl (from -r requirements-xpu.txt (line 12)) because of HTTP error 403 Client Error: Forbidden for url: https://intel-extension-for-pytorch.s3.us-east-1.amazonaws.com/ipex_dev/xpu/torch-2.5.0a0%2Bgite84e33f-cp310-cp310-linux_x86_64.whl for URL https://intel-extension-for-pytorch.s3.us-east-1.amazonaws.com/ipex_dev/xpu/torch-2.5.0a0%2Bgite84e33f-cp310-cp310-linux_x86_64.whl

after removing the AWS URLs, this works:

pip install -v -r requirements-xpu.txt --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/

but there appears to be a version mismatch:

vllm serve Qwen/Qwen2.5-1.5B-Instruct
[W1213 12:50:37.646175735 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
INFO 12-13 12:50:38 api_server.py:634] vLLM API server version 0.6.4.post2.dev351+g969da7d7.d20241213
INFO 12-13 12:50:38 api_server.py:635] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2.5-1.5B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-1.5B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7bb9d887be20>)
INFO 12-13 12:50:38 api_server.py:198] Started engine process with PID 78789
[W1213 12:50:40.395680655 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
INFO 12-13 12:50:42 config.py:446] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed', 'score'}. Defaulting to 'generate'.
WARNING 12-13 12:50:42 _logger.py:68] bfloat16 is not fully supported on XPU, casting to float16.
WARNING 12-13 12:50:42 _logger.py:68] CUDA graph is not supported on XPU, fallback to the eager mode.
INFO 12-13 12:50:45 config.py:446] This model supports multiple tasks: {'generate', 'score', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
WARNING 12-13 12:50:45 _logger.py:68] bfloat16 is not fully supported on XPU, casting to float16.
WARNING 12-13 12:50:45 _logger.py:68] CUDA graph is not supported on XPU, fallback to the eager mode.
INFO 12-13 12:50:45 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post2.dev351+g969da7d7.d20241213) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 12-13 12:50:45 xpu.py:26] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 12-13 12:50:45 selector.py:151] Using IPEX attention backend.
WARNING 12-13 12:50:45 _logger.py:68] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 12-13 12:50:45 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available.
2024:12:13-12:50:45:(78789) |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2024:12:13-12:50:45:(78789) |CCL_WARN| value of CCL_LOCAL_RANK changed to be 0 (default:-1)
2024:12:13-12:50:45:(78789) |CCL_WARN| value of CCL_LOCAL_SIZE changed to be 1 (default:-1)
2024:12:13-12:50:45:(78789) |CCL_WARN| value of CCL_PROCESS_LAUNCHER changed to be none (default:hydra)
2024:12:13-12:50:45:(78789) |CCL_WARN| device_family is unknown, topology discovery could be incorrect, it might result in suboptimal performance
INFO 12-13 12:50:46 weight_utils.py:243] Using model weights format ['*.safetensors']
INFO 12-13 12:50:46 weight_utils.py:288] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.02it/s]

WARNING 12-13 12:50:46 utils.py:727] Pin memory is not supported on XPU.
INFO 12-13 12:50:47 xpu_model_runner.py:415] Loading model weights took 2.8875 GB
ERROR 12-13 12:50:47 engine.py:366] varlen_fwd() takes 14 positional arguments but 15 were given
ERROR 12-13 12:50:47 engine.py:366] Traceback (most recent call last):
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 12-13 12:50:47 engine.py:366]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 12-13 12:50:47 engine.py:366]     return cls(ipc_path=ipc_path,
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 12-13 12:50:47 engine.py:366]     self.engine = LLMEngine(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/llm_engine.py", line 291, in __init__
ERROR 12-13 12:50:47 engine.py:366]     self._initialize_kv_caches()
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/llm_engine.py", line 431, in _initialize_kv_caches
ERROR 12-13 12:50:47 engine.py:366]     self.model_executor.determine_num_available_blocks())
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/executor/gpu_executor.py", line 68, in determine_num_available_blocks
ERROR 12-13 12:50:47 engine.py:366]     return self.driver_worker.determine_num_available_blocks()
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-13 12:50:47 engine.py:366]     return func(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/worker/xpu_worker.py", line 104, in determine_num_available_blocks
ERROR 12-13 12:50:47 engine.py:366]     self.model_runner.profile_run()
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-13 12:50:47 engine.py:366]     return func(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/worker/xpu_model_runner.py", line 492, in profile_run
ERROR 12-13 12:50:47 engine.py:366]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-13 12:50:47 engine.py:366]     return func(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/worker/xpu_model_runner.py", line 566, in execute_model
ERROR 12-13 12:50:47 engine.py:366]     hidden_or_intermediate_states = model_executable(
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 12-13 12:50:47 engine.py:366]     return self._call_impl(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 12-13 12:50:47 engine.py:366]     return forward_call(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/model_executor/models/qwen2.py", line 477, in forward
ERROR 12-13 12:50:47 engine.py:366]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/compilation/decorators.py", line 168, in __call__
ERROR 12-13 12:50:47 engine.py:366]     return self.forward(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/model_executor/models/qwen2.py", line 340, in forward
ERROR 12-13 12:50:47 engine.py:366]     hidden_states, residual = layer(
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 12-13 12:50:47 engine.py:366]     return self._call_impl(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 12-13 12:50:47 engine.py:366]     return forward_call(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/model_executor/models/qwen2.py", line 247, in forward
ERROR 12-13 12:50:47 engine.py:366]     hidden_states = self.self_attn(
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 12-13 12:50:47 engine.py:366]     return self._call_impl(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 12-13 12:50:47 engine.py:366]     return forward_call(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/model_executor/models/qwen2.py", line 176, in forward
ERROR 12-13 12:50:47 engine.py:366]     attn_output = self.attn(q,
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 12-13 12:50:47 engine.py:366]     return self._call_impl(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 12-13 12:50:47 engine.py:366]     return forward_call(*args, **kwargs)
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/attention/layer.py", line 134, in forward
ERROR 12-13 12:50:47 engine.py:366]     return self.impl.forward(query,
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/attention/backends/ipex_attn.py", line 244, in forward
ERROR 12-13 12:50:47 engine.py:366]     ipex_ops.varlen_attention(
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/_ipex_ops.py", line 188, in varlen_attention
ERROR 12-13 12:50:47 engine.py:366]     ipex.llm.functional.varlen_attention(query.contiguous(),
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/functional/fusions.py", line 283, in varlen_attention
ERROR 12-13 12:50:47 engine.py:366]     return VarlenAttention.apply_function(
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/modules/mha_fusion.py", line 379, in apply_function
ERROR 12-13 12:50:47 engine.py:366]     ).apply_function(
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 237, in apply_function
ERROR 12-13 12:50:47 engine.py:366]     _IPEXVarlenScaledDotProductXPU.apply_function_flash_varlen(
ERROR 12-13 12:50:47 engine.py:366]   File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 311, in apply_function_flash_varlen
ERROR 12-13 12:50:47 engine.py:366]     torch.xpu.varlen_fwd(
ERROR 12-13 12:50:47 engine.py:366] TypeError: varlen_fwd() takes 14 positional arguments but 15 were given
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
    raise e
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
    return cls(ipc_path=ipc_path,
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 71, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/llm_engine.py", line 291, in __init__
    self._initialize_kv_caches()
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/engine/llm_engine.py", line 431, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/executor/gpu_executor.py", line 68, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/worker/xpu_worker.py", line 104, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/worker/xpu_model_runner.py", line 492, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/worker/xpu_model_runner.py", line 566, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/model_executor/models/qwen2.py", line 477, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/compilation/decorators.py", line 168, in __call__
    return self.forward(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/model_executor/models/qwen2.py", line 340, in forward
    hidden_states, residual = layer(
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/model_executor/models/qwen2.py", line 247, in forward
    hidden_states = self.self_attn(
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/model_executor/models/qwen2.py", line 176, in forward
    attn_output = self.attn(q,
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/attention/layer.py", line 134, in forward
    return self.impl.forward(query,
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/attention/backends/ipex_attn.py", line 244, in forward
    ipex_ops.varlen_attention(
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/_ipex_ops.py", line 188, in varlen_attention
    ipex.llm.functional.varlen_attention(query.contiguous(),
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/functional/fusions.py", line 283, in varlen_attention
    return VarlenAttention.apply_function(
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/modules/mha_fusion.py", line 379, in apply_function
    ).apply_function(
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 237, in apply_function
    _IPEXVarlenScaledDotProductXPU.apply_function_flash_varlen(
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 311, in apply_function_flash_varlen
    torch.xpu.varlen_fwd(
TypeError: varlen_fwd() takes 14 positional arguments but 15 were given
Traceback (most recent call last):
  File "/home/pepijn/mambaforge/envs/vllm/bin/vllm", line 33, in <module>
    sys.exit(load_entry_point('vllm==0.6.4.post2.dev351+g969da7d7.d20241213.xpu', 'console_scripts', 'vllm')())
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/scripts.py", line 201, in main
    args.dispatch_function(args)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/scripts.py", line 42, in serve
    uvloop.run(run_server(args))
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 658, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 117, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/pepijn/mambaforge/envs/vllm/lib/python3.10/site-packages/vllm-0.6.4.post2.dev351+g969da7d7.d20241213.xpu-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 222, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@pepijndevos pepijndevos added the installation Installation problems label Dec 13, 2024
@QuentinVitt
Copy link

QuentinVitt commented Dec 27, 2024

i have the exact same problem with almost the exact same installation process: instead of removing the AWS URLs i used this as the requirements-xpu.txt:

Common dependencies

-r requirements-common.txt

ray >= 2.9
cmake>=3.26
ninja
packaging
setuptools-scm>=8
wheel
jinja2

#torch @ https://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/ipex_dev/xpu/torch-2.5.0a0%2Bgite84e33f-cp310-cp310-linux_x86_64.whl
#intel-extension-for-pytorch @ https://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/ipex_dev/xpu/intel_extension_for_pytorch-2.5.10%2Bgit9d489a8-cp310-cp310-linux_x86_64.whl
#oneccl_bind_pt @ https://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/ipex_dev/xpu/oneccl_bind_pt-2.5.0%2Bxpu-cp310-cp310-linux_x86_64.whl

triton-xpu == 3.0.0b1

and then used this:

python -m pip install torch==2.5.1+cxx11.abi torchvision==0.20.1+cxx11.abi torchaudio==2.5.1+cxx11.abi intel-extension-for-pytorch==2.5.10+xpu oneccl_bind_pt==2.5.0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/

As for the system i am using:

[W1227 23:08:36.920706038 OperatorEntry.cpp:155] Warning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
Collecting environment information...
PyTorch version: 2.5.1+cxx11.abi
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.10 (x86_64)
GCC version: (Ubuntu 14.2.0-4ubuntu2) 14.2.0
Clang version: Could not collect
CMake version: version 3.31.2
Libc version: glibc-2.40

Python version: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.11.0-13-generic-x86_64-with-glibc2.40
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i9-9820X CPU @ 3.30GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
Stepping: 4
CPU(s) scaling MHz: 23%
CPU max MHz: 7200.0000
CPU min MHz: 1200.0000
BogoMIPS: 6599.98
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault cat_l3 cdp_l3 pti ssbd mba ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_pkg_req vnmi md_clear flush_l1d arch_capabilities
Virtualization: VT-x
L1d cache: 320 KiB (10 instances)
L1i cache: 320 KiB (10 instances)
L2 cache: 10 MiB (10 instances)
L3 cache: 16.5 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-19
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.5.10+xpu
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1+cxx11.abi
[pip3] torchaudio==2.5.1+cxx11.abi
[pip3] torchvision==0.20.1+cxx11.abi
[pip3] transformers==4.47.1
[pip3] triton-xpu==3.0.0b1
[conda] intel-extension-for-pytorch 2.5.10+xpu pypi_0 pypi
[conda] mkl 2025.0.1 pypi_0 pypi
[conda] mkl-dpcpp 2025.0.1 pypi_0 pypi
[conda] numpy 1.26.4 pypi_0 pypi
[conda] onemkl-sycl-blas 2025.0.1 pypi_0 pypi
[conda] onemkl-sycl-datafitting 2025.0.1 pypi_0 pypi
[conda] onemkl-sycl-dft 2025.0.1 pypi_0 pypi
[conda] onemkl-sycl-lapack 2025.0.1 pypi_0 pypi
[conda] onemkl-sycl-rng 2025.0.1 pypi_0 pypi
[conda] onemkl-sycl-sparse 2025.0.1 pypi_0 pypi
[conda] onemkl-sycl-stats 2025.0.1 pypi_0 pypi
[conda] onemkl-sycl-vm 2025.0.1 pypi_0 pypi
[conda] pyzmq 26.2.0 pypi_0 pypi
[conda] torch 2.5.1+cxx11.abi pypi_0 pypi
[conda] torchaudio 2.5.1+cxx11.abi pypi_0 pypi
[conda] torchvision 0.20.1+cxx11.abi pypi_0 pypi
[conda] transformers 4.47.1 pypi_0 pypi
[conda] triton-xpu 3.0.0b1 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.6.post2.dev5+g5ce4627a.d20241227
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

LD_LIBRARY_PATH=/home/quentin/miniconda3/envs/vllm_env310/lib/python3.10/site-packages/cv2/../../lib64:/opt/intel/oneapi/tcm/1.2/lib:/opt/intel/oneapi/umf/0.9/lib:/opt/intel/oneapi/tbb/2022.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/pti/0.10/lib:/opt/intel/oneapi/mpi/2021.14/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.14/lib:/opt/intel/oneapi/mkl/2025.0/lib:/opt/intel/oneapi/ippcp/2025.0/lib/:/opt/intel/oneapi/ipp/2022.0/lib:/opt/intel/oneapi/dnnl/2025.0/lib:/opt/intel/oneapi/debugger/2025.0/opt/debugger/lib:/opt/intel/oneapi/dal/2025.0/lib:/opt/intel/oneapi/compiler/2025.0/opt/compiler/lib:/opt/intel/oneapi/compiler/2025.0/lib:/opt/intel/oneapi/ccl/2021.14/lib/:/home/quentin/miniconda3/envs/vllm_env310/lib/libfabric:

@HiddenPeak
Copy link

Some

sudo python3 -m vllm.entrypoints.openai.api_server
Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 112, in _get_module_details
  File "/home/a5770/rm05/develop/vllmrun/vllm/__init__.py", line 3, in <module>
    from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
  File "/home/a5770/rm05/develop/vllmrun/vllm/engine/arg_utils.py", line 8, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'
(vllm) a5770@a5770-PA602-12900K:~/rm05/develop/vllmrun$ python3 -m vllm.entrypoints.openai.api_server
INFO 01-02 04:16:20 api_server.py:705] vLLM API server version 0.6.6.post2.dev47+ga115ac46
INFO 01-02 04:16:20 api_server.py:706] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='facebook/opt-125m', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
DEBUG 01-02 04:16:20 __init__.py:26] No plugins for group vllm.platform_plugins found.
[W102 04:16:21.808120355 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2960 (function operator())
INFO 01-02 04:16:22 __init__.py:179] Automatically detected platform xpu.
DEBUG 01-02 04:16:22 __init__.py:26] No plugins for group vllm.general_plugins found.
DEBUG 01-02 04:16:22 api_server.py:171] Multiprocessing frontend to use ipc:///tmp/f6342432-5d7d-4ae8-8cdf-eaa7f53f4c5c for IPC Path.
INFO 01-02 04:16:22 api_server.py:190] Started engine process with PID 4024903
DEBUG 01-02 04:16:23 __init__.py:26] No plugins for group vllm.platform_plugins found.
[W102 04:16:23.401228885 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2960 (function operator())
INFO 01-02 04:16:24 __init__.py:179] Automatically detected platform xpu.
DEBUG 01-02 04:16:24 __init__.py:26] No plugins for group vllm.general_plugins found.
INFO 01-02 04:16:29 config.py:517] This model supports multiple tasks: {'embed', 'generate', 'score', 'reward', 'classify'}. Defaulting to 'generate'.
WARNING 01-02 04:16:29 _logger.py:68] CUDA graph is not supported on XPU, fallback to the eager mode.
INFO 01-02 04:16:31 config.py:517] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
WARNING 01-02 04:16:31 _logger.py:68] CUDA graph is not supported on XPU, fallback to the eager mode.
INFO 01-02 04:16:31 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post2.dev47+ga115ac46) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 01-02 04:16:32 xpu.py:26] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 01-02 04:16:32 selector.py:151] Using IPEX attention backend.
WARNING 01-02 04:16:32 _logger.py:68] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 01-02 04:16:32 importing.py:14] Triton not installed or not compatible; certain GPU-related functions will not be available.
DEBUG 01-02 04:16:32 parallel_state.py:959] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.0.231:57029 backend=ccl
2025:01:02-04:16:32:(4024903) |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2025:01:02-04:16:32:(4024903) |CCL_WARN| value of CCL_LOCAL_RANK changed to be 0 (default:-1)
2025:01:02-04:16:32:(4024903) |CCL_WARN| value of CCL_LOCAL_SIZE changed to be 1 (default:-1)
2025:01:02-04:16:32:(4024903) |CCL_WARN| value of CCL_PROCESS_LAUNCHER changed to be none (default:hydra)
DEBUG 01-02 04:16:32 decorators.py:105] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.opt.OPTModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 01-02 04:16:32 config.py:3325] enabled custom ops: Counter()
DEBUG 01-02 04:16:32 config.py:3327] disabled custom ops: Counter()
INFO 01-02 04:16:33 weight_utils.py:251] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
/home/a5770/rm05/develop/vllmrun/vllm/model_executor/model_loader/weight_utils.py:450: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.18it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.18it/s]

WARNING 01-02 04:16:34 _logger.py:68] Pin memory is not supported on XPU.
INFO 01-02 04:16:34 xpu_model_runner.py:415] Loading model weights took 0.2389 GB
ERROR 01-02 04:16:34 engine.py:366] varlen_fwd() takes 14 positional arguments but 15 were given
ERROR 01-02 04:16:34 engine.py:366] Traceback (most recent call last):
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 01-02 04:16:34 engine.py:366]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 01-02 04:16:34 engine.py:366]     return cls(ipc_path=ipc_path,
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 01-02 04:16:34 engine.py:366]     self.engine = LLMEngine(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/engine/llm_engine.py", line 276, in __init__
ERROR 01-02 04:16:34 engine.py:366]     self._initialize_kv_caches()
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/engine/llm_engine.py", line 416, in _initialize_kv_caches
ERROR 01-02 04:16:34 engine.py:366]     self.model_executor.determine_num_available_blocks())
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/executor/gpu_executor.py", line 68, in determine_num_available_blocks
ERROR 01-02 04:16:34 engine.py:366]     return self.driver_worker.determine_num_available_blocks()
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 01-02 04:16:34 engine.py:366]     return func(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/worker/xpu_worker.py", line 104, in determine_num_available_blocks
ERROR 01-02 04:16:34 engine.py:366]     self.model_runner.profile_run()
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 01-02 04:16:34 engine.py:366]     return func(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/worker/xpu_model_runner.py", line 492, in profile_run
ERROR 01-02 04:16:34 engine.py:366]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 01-02 04:16:34 engine.py:366]     return func(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/worker/xpu_model_runner.py", line 566, in execute_model
ERROR 01-02 04:16:34 engine.py:366]     hidden_or_intermediate_states = model_executable(
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-02 04:16:34 engine.py:366]     return self._call_impl(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-02 04:16:34 engine.py:366]     return forward_call(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/model_executor/models/opt.py", line 372, in forward
ERROR 01-02 04:16:34 engine.py:366]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/compilation/decorators.py", line 168, in __call__
ERROR 01-02 04:16:34 engine.py:366]     return self.forward(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/model_executor/models/opt.py", line 323, in forward
ERROR 01-02 04:16:34 engine.py:366]     return self.decoder(input_ids,
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-02 04:16:34 engine.py:366]     return self._call_impl(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-02 04:16:34 engine.py:366]     return forward_call(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/model_executor/models/opt.py", line 280, in forward
ERROR 01-02 04:16:34 engine.py:366]     hidden_states = layer(hidden_states,
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-02 04:16:34 engine.py:366]     return self._call_impl(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-02 04:16:34 engine.py:366]     return forward_call(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/model_executor/models/opt.py", line 173, in forward
ERROR 01-02 04:16:34 engine.py:366]     hidden_states = self.self_attn(hidden_states=hidden_states,
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-02 04:16:34 engine.py:366]     return self._call_impl(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-02 04:16:34 engine.py:366]     return forward_call(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/model_executor/models/opt.py", line 113, in forward
ERROR 01-02 04:16:34 engine.py:366]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-02 04:16:34 engine.py:366]     return self._call_impl(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-02 04:16:34 engine.py:366]     return forward_call(*args, **kwargs)
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/attention/layer.py", line 134, in forward
ERROR 01-02 04:16:34 engine.py:366]     return self.impl.forward(query,
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/attention/backends/ipex_attn.py", line 244, in forward
ERROR 01-02 04:16:34 engine.py:366]     ipex_ops.varlen_attention(
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/rm05/develop/vllmrun/vllm/_ipex_ops.py", line 188, in varlen_attention
ERROR 01-02 04:16:34 engine.py:366]     ipex.llm.functional.varlen_attention(query.contiguous(),
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/functional/fusions.py", line 283, in varlen_attention
ERROR 01-02 04:16:34 engine.py:366]     return VarlenAttention.apply_function(
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/modules/mha_fusion.py", line 379, in apply_function
ERROR 01-02 04:16:34 engine.py:366]     ).apply_function(
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 237, in apply_function
ERROR 01-02 04:16:34 engine.py:366]     _IPEXVarlenScaledDotProductXPU.apply_function_flash_varlen(
ERROR 01-02 04:16:34 engine.py:366]   File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 311, in apply_function_flash_varlen
ERROR 01-02 04:16:34 engine.py:366]     torch.xpu.varlen_fwd(
ERROR 01-02 04:16:34 engine.py:366] TypeError: varlen_fwd() takes 14 positional arguments but 15 were given
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
    raise e
  File "/home/a5770/rm05/develop/vllmrun/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/home/a5770/rm05/develop/vllmrun/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
    return cls(ipc_path=ipc_path,
  File "/home/a5770/rm05/develop/vllmrun/vllm/engine/multiprocessing/engine.py", line 71, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/engine/llm_engine.py", line 276, in __init__
    self._initialize_kv_caches()
  File "/home/a5770/rm05/develop/vllmrun/vllm/engine/llm_engine.py", line 416, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/home/a5770/rm05/develop/vllmrun/vllm/executor/gpu_executor.py", line 68, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/worker/xpu_worker.py", line 104, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/worker/xpu_model_runner.py", line 492, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/worker/xpu_model_runner.py", line 566, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/model_executor/models/opt.py", line 372, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/home/a5770/rm05/develop/vllmrun/vllm/compilation/decorators.py", line 168, in __call__
    return self.forward(*args, **kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/model_executor/models/opt.py", line 323, in forward
    return self.decoder(input_ids,
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/model_executor/models/opt.py", line 280, in forward
    hidden_states = layer(hidden_states,
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/model_executor/models/opt.py", line 173, in forward
    hidden_states = self.self_attn(hidden_states=hidden_states,
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/model_executor/models/opt.py", line 113, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/a5770/rm05/develop/vllmrun/vllm/attention/layer.py", line 134, in forward
    return self.impl.forward(query,
  File "/home/a5770/rm05/develop/vllmrun/vllm/attention/backends/ipex_attn.py", line 244, in forward
    ipex_ops.varlen_attention(
  File "/home/a5770/rm05/develop/vllmrun/vllm/_ipex_ops.py", line 188, in varlen_attention
    ipex.llm.functional.varlen_attention(query.contiguous(),
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/functional/fusions.py", line 283, in varlen_attention
    return VarlenAttention.apply_function(
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/modules/mha_fusion.py", line 379, in apply_function
    ).apply_function(
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 237, in apply_function
    _IPEXVarlenScaledDotProductXPU.apply_function_flash_varlen(
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 311, in apply_function_flash_varlen
    torch.xpu.varlen_fwd(
TypeError: varlen_fwd() takes 14 positional arguments but 15 were given
DEBUG 01-02 04:16:40 client.py:252] Shutting down MQLLMEngineClient output handler.
Traceback (most recent call last):
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/a5770/rm05/develop/vllmrun/vllm/entrypoints/openai/api_server.py", line 767, in <module>
    uvloop.run(run_server(args))
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/a5770/rm05/develop/vllmrun/vllm/entrypoints/openai/api_server.py", line 733, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/a5770/rm05/develop/vllmrun/vllm/entrypoints/openai/api_server.py", line 120, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/home/a5770/miniforge3/envs/vllm/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/a5770/rm05/develop/vllmrun/vllm/entrypoints/openai/api_server.py", line 214, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
`

@Ankur-singh
Copy link

I'm facing the exact same issue. Any updates?

@jikunshang
Copy link
Contributor

are you using intel arc graphic card? there is a bug in dispatch code on arc card. a quick solution is sed -i '326d' CONDA_ENV_PATH/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/m ha_fusion.py
sorry for late response.

@udif
Copy link

udif commented Jan 22, 2025

I'm also having issues with Arc770, but I think I've solved the requirements-xpu.txt issue:

# Common dependencies
-r requirements-common.txt

ray >= 2.9
cmake>=3.26
ninja
packaging
setuptools-scm>=8
wheel
jinja2

torch==2.5.1+cxx11.abi
torchvision==0.20.1+cxx11.abi
torchaudio==2.5.1+cxx11.abi
intel-extension-for-pytorch==2.5.10+xpu
oneccl_bind_pt==2.5.0+xpu
--extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

triton-xpu == 3.2.0b1

If you want the older oneapi 2024.2 install, you can get it if you search for the download page using the wayback machine.

@cgruver
Copy link

cgruver commented Mar 5, 2025

Related issue?

#14295

@cgruver
Copy link

cgruver commented Mar 5, 2025

Also note - Note that Intel® XPU Backend for Triton* is not compatible with Intel® Extension for PyTorch* and Intel® oneAPI Base Toolkit*.

From - https://github.com/intel/intel-xpu-backend-for-triton

Looks like Triton and intel-extension-for-pytorch are mutually exclusive.

@jikunshang
Copy link
Contributor

Also note - Note that Intel® XPU Backend for Triton* is not compatible with Intel® Extension for PyTorch* and Intel® oneAPI Base Toolkit*.

From - https://github.com/intel/intel-xpu-backend-for-triton

Looks like Triton and intel-extension-for-pytorch are mutually exclusive.

  1. we didn't actually use any execution component from intel-xpu-backend-for-triton yet, so there are not compatible issue.
  2. For further version, we can leverage vllm compile framework to bypass some potential compatible issue. this part it still WIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
installation Installation problems
Projects
None yet
Development

No branches or pull requests

7 participants