vLLM api server patching #152

KeremTurgutlu · 2025-02-21T17:08:49Z

Congrats on the vLLM update!

The current example shows how to run gemlite backend with the LLM class by applying the patch in the same process. However, this approach doesn't work if a user wants to run the openai compatible api server of vLLM with the MQLLMEngine which is generally more suitable for production loads.

To use that engine with the openai api server we need to directly patch the vLLM engine.py the reason for this is that it is using the spawn method to create a child process here.

Here is my simple script to apply the patch into the engine.py. I am not sure how you would like to incorporate this but wanted to share.

import sys,re
from pathlib import Path

# Find vllm package location
vllm_location = None
for p in sys.path:
    engine_path = Path(p)/"vllm/engine/multiprocessing/engine.py"
    if engine_path.exists(): vllm_location = engine_path; break
if vllm_location is None: raise Exception("Could not find vllm engine.py")
content = vllm_location.read_text()

patch = """
from hqq.utils.vllm import set_vllm_hqq_backend, VLLM_HQQ_BACKEND
set_vllm_hqq_backend(backend=VLLM_HQQ_BACKEND.GEMLITE)

from gemlite.triton_kernels.config import set_autotune
autotune_dict = dict(
    GEMV = False,
    GEMV_REVSPLITK = False,
    GEMV_SPLITK    = False,
    GEMM_SPLITK    = False,
    GEMM           = False,
    EXHAUSTIVE     = False,
    USE_CUDA_GRAPH = False
)
set_autotune(autotune_dict)
"""

# Find last import statement
import_matches = list(re.finditer(r'^(?:from|import)\s+.*$', content, re.MULTILINE))
if not import_matches: raise Exception("No import statements found")
last_import_pos = import_matches[-1].end()

# Insert patch after last import
new_content = content[:last_import_pos] + "\n" + patch + content[last_import_pos:]

# Backup original
backup_path = vllm_location.parent/"engine.py.bak"
if not backup_path.exists(): vllm_location.rename(backup_path)

# Write patched version
vllm_location.write_text(new_content)

print(f"Patched {vllm_location}")
print(f"Backup saved to {backup_path}")

The text was updated successfully, but these errors were encountered:

mobicham · 2025-02-21T19:08:58Z

Thanks Kerem! We internally use vllm with ray via LLM, but this could be useful for people using via the openai api server indeed, unless they do it manually in engine.py !
Maybe we can put it in examples/vllm_opanaiserver.py or something?

KeremTurgutlu · 2025-02-22T10:24:02Z

Sounds good, if you are fine with it to be added as an example I can do that and maybe we can add a little note in the readme for those who want to use the api server in vLLM. By the way is there a reason for you to prefer ray? Native vLLM fastapi server had been working fine for us so far but would love to learn more about ray's advantages. Thanks!

mobicham · 2025-02-22T15:21:08Z

Sounds good to me, feel free to do a PR!

It's because we support different backends, not just vllm, since we also need to run other non-llm models.
We have the SDK code open-source by the way: https://github.com/mobiusml/aana_sdk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM api server patching #152

vLLM api server patching #152

KeremTurgutlu commented Feb 21, 2025 •

edited

Loading

mobicham commented Feb 21, 2025 •

edited

Loading

KeremTurgutlu commented Feb 22, 2025

mobicham commented Feb 22, 2025

vLLM api server patching #152

vLLM api server patching #152

Comments

KeremTurgutlu commented Feb 21, 2025 • edited Loading

mobicham commented Feb 21, 2025 • edited Loading

KeremTurgutlu commented Feb 22, 2025

mobicham commented Feb 22, 2025

KeremTurgutlu commented Feb 21, 2025 •

edited

Loading

mobicham commented Feb 21, 2025 •

edited

Loading