[Bug][V1]: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0 #14286

fahadh4ilyas · 2025-03-05T12:50:10Z

Your current environment

The output of `python collect_env.py`

PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 535.183.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        39 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               16
On-line CPU(s) list:                  0-15
Vendor ID:                            GenuineIntel
Model name:                           13th Gen Intel(R) Core(TM) i5-13400
CPU family:                           6
Model:                                191
Thread(s) per core:                   2
Core(s) per socket:                   10
Socket(s):                            1
Stepping:                             2
CPU max MHz:                          4600,0000
CPU min MHz:                          800,0000
BogoMIPS:                             4992.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            416 KiB (10 instances)
L1i cache:                            448 KiB (10 instances)
L2 cache:                             9,5 MiB (7 instances)
L3 cache:                             20 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-15
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Mitigation; Clear Register File
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0
[pip3] triton==3.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pyzmq                     26.2.1                   pypi_0    pypi
[conda] torch                     2.5.1                    pypi_0    pypi
[conda] torchaudio                2.5.1                    pypi_0    pypi
[conda] torchvision               0.20.1                   pypi_0    pypi
[conda] transformers              4.49.0                   pypi_0    pypi
[conda] triton                    3.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.4.dev30+gda31b533
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

LD_LIBRARY_PATH=/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/cv2/../../lib64:/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/nvidia/nvjitlink/lib:/usr/local/cuda-12.2/lib64
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

I have a Llama3.1 model with this config

`config.json` of Llama3.1-8B-INT8 model

{
  "_name_or_path": "/network/alexandre/models/meta_llama__Meta-Llama-3.1-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "compression_config": {
    "config_groups": {
      "group_0": {
        "input_activations": null,
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "block_structure": null,
          "dynamic": false,
          "group_size": null,
          "num_bits": 8,
          "observer": "minmax",
          "observer_kwargs": {},
          "strategy": "channel",
          "symmetric": true,
          "type": "int"
        }
      }
    },
    "format": "pack-quantized",
    "global_compression_ratio": 1.458959021545191,
    "ignore": [
      "lm_head"
    ],
    "kv_cache_scheme": null,
    "quant_method": "compressed-tensors",
    "quantization_status": "frozen",
    "sparsity_config": {
      "format": "dense",
      "global_sparsity": 1.2473729422557387,
      "registry_requires_subclass": false,
      "sparsity_structure": "unstructured"
    }
  },
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.1",
  "use_cache": true,
  "vocab_size": 128256
}

I load it using vllm like this

VLLM_USE_V1=1 vllm serve models/llama3.1-8B-INT8 --host 0.0.0.0 --served-model-name llama3.1-8B llama3.1-8B-Int8 --port 8000 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser llama3_json

But, I got this error

Log error loading Llama3.1 using VLLM V1

INFO 03-05 19:37:21 __init__.py:207] Automatically detected platform cuda.
INFO 03-05 19:37:22 api_server.py:911] vLLM API server version 0.7.4.dev30+gda31b533
INFO 03-05 19:37:22 api_server.py:912] args: Namespace(subparser='serve', model_tag='models/llama3.1-8B-INT8', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, enable_reasoning=False, reasoning_parser=None, tool_call_parser='llama3_json', tool_parser_plugin='', model='models/llama3.1-8B-INT8', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=65536, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama3.1-8B', 'llama3.1-8B-Int8'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x7934b0274540>)
WARNING 03-05 19:37:22 arg_utils.py:1387] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
INFO 03-05 19:37:25 config.py:553] This model supports multiple tasks: {'embed', 'classify', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
INFO 03-05 19:37:25 config.py:1561] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-05 19:37:27 __init__.py:207] Automatically detected platform cuda.
INFO 03-05 19:37:28 core.py:50] Initializing a V1 LLM engine (v0.7.4.dev30+gda31b533) with config: model='models/llama3.1-8B-INT8', speculative_config=None, tokenizer='models/llama3.1-8B-INT8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama3.1-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-05 19:37:28 utils.py:2277] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7211f3cb5050>
INFO 03-05 19:37:29 gpu_model_runner.py:1050] Starting to load model models/llama3.1-8B-INT8...
INFO 03-05 19:37:29 compressed_tensors_wNa16.py:85] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 03-05 19:37:29 cuda.py:157] Using Flash Attention backend on V1 engine.
WARNING 03-05 19:37:29 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 03-05 19:37:29 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:05<00:05,  5.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.69s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.58s/it]

INFO 03-05 19:37:40 gpu_model_runner.py:1062] Loading model weights took 8.4927 GB and 11.407232 seconds
INFO 03-05 19:37:45 backends.py:408] Using cache directory: /home/fahadh/.cache/vllm/torch_compile_cache/47e23b4825/rank_0 for vLLM's torch.compile
INFO 03-05 19:37:45 backends.py:418] Dynamo bytecode transform time: 5.12 s
INFO 03-05 19:37:47 backends.py:132] Cache the graph of shape None for later use
INFO 03-05 19:38:02 backends.py:144] Compiling a graph for general shape takes 15.78 s
INFO 03-05 19:38:04 monitor.py:33] torch.compile takes 20.90 s in total
ERROR 03-05 19:38:04 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-05 19:38:04 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 283, in run_engine_core
ERROR 03-05 19:38:04 core.py:291]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 03-05 19:38:04 core.py:291]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-05 19:38:04 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 238, in __init__
ERROR 03-05 19:38:04 core.py:291]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 03-05 19:38:04 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 59, in __init__
ERROR 03-05 19:38:04 core.py:291]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 03-05 19:38:04 core.py:291]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-05 19:38:04 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 99, in _initialize_kv_caches
ERROR 03-05 19:38:04 core.py:291]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 03-05 19:38:04 core.py:291]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-05 19:38:04 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 61, in determine_available_memory
ERROR 03-05 19:38:04 core.py:291]     output = self.collective_rpc("determine_available_memory")
ERROR 03-05 19:38:04 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-05 19:38:04 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 03-05 19:38:04 core.py:291]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 03-05 19:38:04 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-05 19:38:04 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/utils.py", line 2211, in run_method
ERROR 03-05 19:38:04 core.py:291]     return func(*args, **kwargs)
ERROR 03-05 19:38:04 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-05 19:38:04 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-05 19:38:04 core.py:291]     return func(*args, **kwargs)
ERROR 03-05 19:38:04 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-05 19:38:04 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 154, in determine_available_memory
ERROR 03-05 19:38:04 core.py:291]     self.model_runner.profile_run()
ERROR 03-05 19:38:04 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1311, in profile_run
ERROR 03-05 19:38:04 core.py:291]     dummy_metadata = SamplingMetadata(
ERROR 03-05 19:38:04 core.py:291]                      ^^^^^^^^^^^^^^^^^
ERROR 03-05 19:38:04 core.py:291] TypeError: SamplingMetadata.__init__() missing 1 required positional argument: 'allowed_token_ids_mask'
ERROR 03-05 19:38:04 core.py:291]
CRITICAL 03-05 19:38:04 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

If I load it using vllm like this

vllm serve models/llama3.1-8B-INT8 --host 0.0.0.0 --served-model-name llama3.1-8B llama3.1-8B-Int8 --port 8000 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser llama3_json

it works perfectly fine

Log loading Llama3.1 using VLLM V0

INFO 03-05 19:38:44 api_server.py:911] vLLM API server version 0.7.4.dev30+gda31b533
INFO 03-05 19:38:44 api_server.py:912] args: Namespace(subparser='serve', model_tag='models/llama3.1-8B-INT8', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, enable_reasoning=False, reasoning_parser=None, tool_call_parser='llama3_json', tool_parser_plugin='', model='models/llama3.1-8B-INT8', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=65536, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama3.1-8B', 'llama3.1-8B-Int8'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x711137ec0ea0>)
INFO 03-05 19:38:44 api_server.py:208] Started engine process with PID 342161
INFO 03-05 19:38:46 __init__.py:207] Automatically detected platform cuda.
INFO 03-05 19:38:47 config.py:553] This model supports multiple tasks: {'embed', 'score', 'classify', 'generate', 'reward'}. Defaulting to 'generate'.
WARNING 03-05 19:38:47 arg_utils.py:1190] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 03-05 19:38:47 config.py:1561] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-05 19:38:49 config.py:553] This model supports multiple tasks: {'score', 'generate', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
WARNING 03-05 19:38:50 arg_utils.py:1190] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 03-05 19:38:50 config.py:1561] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-05 19:38:50 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.4.dev30+gda31b533) with config: model='models/llama3.1-8B-INT8', speculative_config=None, tokenizer='models/llama3.1-8B-INT8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama3.1-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 03-05 19:38:50 cuda.py:229] Using Flash Attention backend.
INFO 03-05 19:38:50 model_runner.py:1110] Starting to load model models/llama3.1-8B-INT8...
INFO 03-05 19:38:50 compressed_tensors_wNa16.py:85] Using MarlinLinearKernel for CompressedTensorsWNA16
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:05<00:05,  5.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.69s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.58s/it]

INFO 03-05 19:39:02 model_runner.py:1117] Loading model weights took 8.4927 GB and 11.396740 seconds
INFO 03-05 19:39:03 worker.py:267] Memory profiling takes 0.68 seconds
INFO 03-05 19:39:03 worker.py:267] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.28GiB
INFO 03-05 19:39:03 worker.py:267] model weights take 8.49GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 11.52GiB.
INFO 03-05 19:39:03 executor_base.py:111] # cuda blocks: 5899, # CPU blocks: 2048
INFO 03-05 19:39:03 executor_base.py:116] Maximum concurrency for 65536 tokens per request: 1.44x
INFO 03-05 19:39:04 model_runner.py:1437] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████| 35/35 [00:08<00:00,  3.94it/s]
INFO 03-05 19:39:13 model_runner.py:1565] Graph capturing finished in 9 secs, took 0.98 GiB
INFO 03-05 19:39:13 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 10.78 seconds
INFO 03-05 19:39:13 serving_chat.py:76] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
INFO 03-05 19:39:13 api_server.py:957] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-05 19:39:13 launcher.py:23] Available routes are:
INFO 03-05 19:39:13 launcher.py:31] Route: /openapi.json, Methods: GET, HEAD
INFO 03-05 19:39:13 launcher.py:31] Route: /docs, Methods: GET, HEAD
INFO 03-05 19:39:13 launcher.py:31] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 03-05 19:39:13 launcher.py:31] Route: /redoc, Methods: GET, HEAD
INFO 03-05 19:39:13 launcher.py:31] Route: /health, Methods: GET
INFO 03-05 19:39:13 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 03-05 19:39:13 launcher.py:31] Route: /tokenize, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /detokenize, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /v1/models, Methods: GET
INFO 03-05 19:39:13 launcher.py:31] Route: /version, Methods: GET
INFO 03-05 19:39:13 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /pooling, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /score, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /v1/score, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /rerank, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 03-05 19:39:13 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [342138]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

What I know is the culprit is this commit da31b53. Because if I tried it using the commit before it which is bb78fb3, it works perfectly

Log loading Llama3.1 using VLLM V1 in commit bb78fb3

INFO 03-05 19:35:26 __init__.py:207] Automatically detected platform cuda.
INFO 03-05 19:35:27 api_server.py:911] vLLM API server version 0.7.4.dev29+gbb78fb31
INFO 03-05 19:35:27 api_server.py:912] args: Namespace(subparser='serve', model_tag='models/llama3.1-8B-INT8', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, enable_reasoning=False, reasoning_parser=None, tool_call_parser='llama3_json', tool_parser_plugin='', model='models/llama3.1-8B-INT8', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=65536, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama3.1-8B', 'llama3.1-8B-Int8'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x7c1f9ac70540>)
WARNING 03-05 19:35:27 arg_utils.py:1387] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
INFO 03-05 19:35:30 config.py:553] This model supports multiple tasks: {'generate', 'classify', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
INFO 03-05 19:35:30 config.py:1561] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-05 19:35:33 __init__.py:207] Automatically detected platform cuda.
INFO 03-05 19:35:33 core.py:50] Initializing a V1 LLM engine (v0.7.4.dev29+gbb78fb31) with config: model='models/llama3.1-8B-INT8', speculative_config=None, tokenizer='models/llama3.1-8B-INT8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama3.1-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-05 19:35:34 utils.py:2277] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x789bb8ca9910>
INFO 03-05 19:35:34 gpu_model_runner.py:1049] Starting to load model models/llama3.1-8B-INT8...
INFO 03-05 19:35:34 compressed_tensors_wNa16.py:85] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 03-05 19:35:34 cuda.py:157] Using Flash Attention backend on V1 engine.
WARNING 03-05 19:35:34 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 03-05 19:35:34 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:05<00:05,  5.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.68s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.58s/it]

INFO 03-05 19:35:45 gpu_model_runner.py:1061] Loading model weights took 8.4927 GB and 11.400595 seconds
INFO 03-05 19:35:51 backends.py:408] Using cache directory: /home/fahadh/.cache/vllm/torch_compile_cache/1e46ef2c33/rank_0 for vLLM's torch.compile
INFO 03-05 19:35:51 backends.py:418] Dynamo bytecode transform time: 5.18 s
INFO 03-05 19:35:52 backends.py:132] Cache the graph of shape None for later use
INFO 03-05 19:36:07 backends.py:144] Compiling a graph for general shape takes 15.82 s
INFO 03-05 19:36:09 monitor.py:33] torch.compile takes 21.00 s in total
INFO 03-05 19:36:10 kv_cache_utils.py:524] GPU KV cache size: 95,536 tokens
INFO 03-05 19:36:10 kv_cache_utils.py:527] Maximum concurrency for 65,536 tokens per request: 1.46x
INFO 03-05 19:36:27 gpu_model_runner.py:1341] Graph capturing finished in 17 secs, took 2.02 GiB
INFO 03-05 19:36:27 core.py:116] init engine (profile, create kv cache, warmup model) took 41.34 seconds
INFO 03-05 19:36:27 serving_chat.py:76] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
INFO 03-05 19:36:27 api_server.py:957] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-05 19:36:27 launcher.py:23] Available routes are:
INFO 03-05 19:36:27 launcher.py:31] Route: /openapi.json, Methods: GET, HEAD
INFO 03-05 19:36:27 launcher.py:31] Route: /docs, Methods: GET, HEAD
INFO 03-05 19:36:27 launcher.py:31] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 03-05 19:36:27 launcher.py:31] Route: /redoc, Methods: GET, HEAD
INFO 03-05 19:36:27 launcher.py:31] Route: /health, Methods: GET
INFO 03-05 19:36:27 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 03-05 19:36:27 launcher.py:31] Route: /tokenize, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /detokenize, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /v1/models, Methods: GET
INFO 03-05 19:36:27 launcher.py:31] Route: /version, Methods: GET
INFO 03-05 19:36:27 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /pooling, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /score, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /v1/score, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /rerank, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 03-05 19:36:27 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [341093]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Any explanation why this happened?

cc: @JenZhao and @ywang96

EDIT:

I didn't realize that the error from da31b53 is different with the error that I got from the last commit in main. After checking it more, I found that between commit da31b53 and 7f6bae5, the error is about SamplingMetadata. But, from this commit 7f6bae5 and go on, the error is OOM. Here is the real OOM error appeared:

Log OOM error loading Llama3.1 using VLLM V1 in commit 7f6bae5

INFO 03-05 20:16:23 __init__.py:207] Automatically detected platform cuda.
INFO 03-05 20:16:24 api_server.py:911] vLLM API server version 0.7.4.dev35+g7f6bae56
INFO 03-05 20:16:24 api_server.py:912] args: Namespace(subparser='serve', model_tag='models/llama3.1-8B-INT8', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, enable_reasoning=False, reasoning_parser=None, tool_call_parser='llama3_json', tool_parser_plugin='', model='models/llama3.1-8B-INT8', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=65536, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama3.1-8B', 'llama3.1-8B-Int8'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x726198670720>)
WARNING 03-05 20:16:24 arg_utils.py:1407] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
INFO 03-05 20:16:27 config.py:559] This model supports multiple tasks: {'score', 'embed', 'reward', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 03-05 20:16:27 config.py:1567] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-05 20:16:29 __init__.py:207] Automatically detected platform cuda.
INFO 03-05 20:16:30 core.py:50] Initializing a V1 LLM engine (v0.7.4.dev35+g7f6bae56) with config: model='models/llama3.1-8B-INT8', speculative_config=None, tokenizer='models/llama3.1-8B-INT8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama3.1-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-05 20:16:30 utils.py:2279] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x77a461bbcc90>
INFO 03-05 20:16:31 gpu_model_runner.py:1050] Starting to load model models/llama3.1-8B-INT8...
INFO 03-05 20:16:31 compressed_tensors_wNa16.py:85] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 03-05 20:16:31 cuda.py:157] Using Flash Attention backend on V1 engine.
WARNING 03-05 20:16:31 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 03-05 20:16:31 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:05<00:05,  5.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.69s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.58s/it]

INFO 03-05 20:16:42 gpu_model_runner.py:1062] Loading model weights took 8.4927 GB and 11.406877 seconds
INFO 03-05 20:16:47 backends.py:408] Using cache directory: /home/fahadh/.cache/vllm/torch_compile_cache/82c902e0ff/rank_0 for vLLM's torch.compile
INFO 03-05 20:16:47 backends.py:418] Dynamo bytecode transform time: 5.12 s
INFO 03-05 20:16:49 backends.py:132] Cache the graph of shape None for later use
INFO 03-05 20:17:04 backends.py:144] Compiling a graph for general shape takes 15.86 s
INFO 03-05 20:17:06 monitor.py:33] torch.compile takes 20.98 s in total
ERROR 03-05 20:17:07 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-05 20:17:07 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 283, in run_engine_core
ERROR 03-05 20:17:07 core.py:291]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 03-05 20:17:07 core.py:291]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-05 20:17:07 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 238, in __init__
ERROR 03-05 20:17:07 core.py:291]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 03-05 20:17:07 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 59, in __init__
ERROR 03-05 20:17:07 core.py:291]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 03-05 20:17:07 core.py:291]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-05 20:17:07 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 102, in _initialize_kv_caches
ERROR 03-05 20:17:07 core.py:291]     kv_cache_configs = get_kv_cache_configs(vllm_config, kv_cache_specs,
ERROR 03-05 20:17:07 core.py:291]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-05 20:17:07 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 563, in get_kv_cache_configs
ERROR 03-05 20:17:07 core.py:291]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec,
ERROR 03-05 20:17:07 core.py:291]   File "/home/fahadh/anaconda3/envs/test-vllm/lib/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 465, in check_enough_kv_cache_memory
ERROR 03-05 20:17:07 core.py:291]     raise ValueError(
ERROR 03-05 20:17:07 core.py:291] ValueError: To serve at least one request with the models's max seq len (65536), (8.00 GB KV cache is needed, which is larger than the available KV cache memory (6.48 GB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
ERROR 03-05 20:17:07 core.py:291]
CRITICAL 03-05 20:17:07 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

And it works perfectly fine for vllm v0

Log loading Llama3.1 using VLLM V0 with commit 7f6bae5

INFO 03-05 20:22:38 __init__.py:207] Automatically detected platform cuda.
INFO 03-05 20:22:38 api_server.py:911] vLLM API server version 0.7.4.dev35+g7f6bae56
INFO 03-05 20:22:38 api_server.py:912] args: Namespace(subparser='serve', model_tag='models/llama3.1-8B-INT8', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, enable_reasoning=False, reasoning_parser=None, tool_call_parser='llama3_json', tool_parser_plugin='', model='models/llama3.1-8B-INT8', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=65536, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama3.1-8B', 'llama3.1-8B-Int8'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x78af7bc9cfe0>)
INFO 03-05 20:22:38 api_server.py:208] Started engine process with PID 352366
INFO 03-05 20:22:41 __init__.py:207] Automatically detected platform cuda.
INFO 03-05 20:22:41 config.py:559] This model supports multiple tasks: {'classify', 'reward', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
WARNING 03-05 20:22:42 arg_utils.py:1204] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 03-05 20:22:42 config.py:1567] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-05 20:22:44 config.py:559] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 03-05 20:22:44 arg_utils.py:1204] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 03-05 20:22:44 config.py:1567] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-05 20:22:44 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.4.dev35+g7f6bae56) with config: model='models/llama3.1-8B-INT8', speculative_config=None, tokenizer='models/llama3.1-8B-INT8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama3.1-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 03-05 20:22:44 cuda.py:229] Using Flash Attention backend.
INFO 03-05 20:22:45 model_runner.py:1110] Starting to load model models/llama3.1-8B-INT8...
INFO 03-05 20:22:45 compressed_tensors_wNa16.py:85] Using MarlinLinearKernel for CompressedTensorsWNA16
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:05<00:05,  5.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.69s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:11<00:00,  5.58s/it]

INFO 03-05 20:22:56 model_runner.py:1117] Loading model weights took 8.4927 GB and 11.399188 seconds
INFO 03-05 20:22:57 worker.py:267] Memory profiling takes 0.68 seconds
INFO 03-05 20:22:57 worker.py:267] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.28GiB
INFO 03-05 20:22:57 worker.py:267] model weights take 8.49GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 11.52GiB.
INFO 03-05 20:22:57 executor_base.py:111] # cuda blocks: 5899, # CPU blocks: 2048
INFO 03-05 20:22:57 executor_base.py:116] Maximum concurrency for 65536 tokens per request: 1.44x
INFO 03-05 20:22:58 model_runner.py:1437] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:08<00:00,  3.93it/s]
INFO 03-05 20:23:07 model_runner.py:1565] Graph capturing finished in 9 secs, took 0.98 GiB
INFO 03-05 20:23:07 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 10.82 seconds
INFO 03-05 20:23:07 serving_chat.py:76] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
INFO 03-05 20:23:07 api_server.py:957] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-05 20:23:07 launcher.py:23] Available routes are:
INFO 03-05 20:23:07 launcher.py:31] Route: /openapi.json, Methods: GET, HEAD
INFO 03-05 20:23:07 launcher.py:31] Route: /docs, Methods: GET, HEAD
INFO 03-05 20:23:07 launcher.py:31] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 03-05 20:23:07 launcher.py:31] Route: /redoc, Methods: GET, HEAD
INFO 03-05 20:23:07 launcher.py:31] Route: /health, Methods: GET
INFO 03-05 20:23:07 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 03-05 20:23:07 launcher.py:31] Route: /tokenize, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /detokenize, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /v1/models, Methods: GET
INFO 03-05 20:23:07 launcher.py:31] Route: /version, Methods: GET
INFO 03-05 20:23:07 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /pooling, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /score, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /v1/score, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /rerank, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 03-05 20:23:07 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [352343]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

CC: @DarkLight1337

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

fahadh4ilyas added the bug Something isn't working label Mar 5, 2025

fahadh4ilyas changed the title ~~[Bug][V1}: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0~~ [Bug][V1]: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0 Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][V1]: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0 #14286

[Bug][V1]: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0 #14286

fahadh4ilyas commented Mar 5, 2025 •

edited

Loading

[Bug][V1]: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0 #14286

[Bug][V1]: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0 #14286

Comments

fahadh4ilyas commented Mar 5, 2025 • edited Loading

Your current environment

🐛 Describe the bug

Before submitting a new issue...

fahadh4ilyas commented Mar 5, 2025 •

edited

Loading