Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM api server patching #152

Open
KeremTurgutlu opened this issue Feb 21, 2025 · 3 comments
Open

vLLM api server patching #152

KeremTurgutlu opened this issue Feb 21, 2025 · 3 comments

Comments

@KeremTurgutlu
Copy link
Contributor

KeremTurgutlu commented Feb 21, 2025

Congrats on the vLLM update!

The current example shows how to run gemlite backend with the LLM class by applying the patch in the same process. However, this approach doesn't work if a user wants to run the openai compatible api server of vLLM with the MQLLMEngine which is generally more suitable for production loads.

To use that engine with the openai api server we need to directly patch the vLLM engine.py the reason for this is that it is using the spawn method to create a child process here.

Here is my simple script to apply the patch into the engine.py. I am not sure how you would like to incorporate this but wanted to share.

import sys,re
from pathlib import Path

# Find vllm package location
vllm_location = None
for p in sys.path:
    engine_path = Path(p)/"vllm/engine/multiprocessing/engine.py"
    if engine_path.exists(): vllm_location = engine_path; break
if vllm_location is None: raise Exception("Could not find vllm engine.py")
content = vllm_location.read_text()

patch = """
from hqq.utils.vllm import set_vllm_hqq_backend, VLLM_HQQ_BACKEND
set_vllm_hqq_backend(backend=VLLM_HQQ_BACKEND.GEMLITE)

from gemlite.triton_kernels.config import set_autotune
autotune_dict = dict(
    GEMV = False,
    GEMV_REVSPLITK = False,
    GEMV_SPLITK    = False,
    GEMM_SPLITK    = False,
    GEMM           = False,
    EXHAUSTIVE     = False,
    USE_CUDA_GRAPH = False
)
set_autotune(autotune_dict)
"""

# Find last import statement
import_matches = list(re.finditer(r'^(?:from|import)\s+.*$', content, re.MULTILINE))
if not import_matches: raise Exception("No import statements found")
last_import_pos = import_matches[-1].end()

# Insert patch after last import
new_content = content[:last_import_pos] + "\n" + patch + content[last_import_pos:]

# Backup original
backup_path = vllm_location.parent/"engine.py.bak"
if not backup_path.exists(): vllm_location.rename(backup_path)

# Write patched version
vllm_location.write_text(new_content)

print(f"Patched {vllm_location}")
print(f"Backup saved to {backup_path}")
@mobicham
Copy link
Collaborator

mobicham commented Feb 21, 2025

Thanks Kerem! We internally use vllm with ray via LLM, but this could be useful for people using via the openai api server indeed, unless they do it manually in engine.py !
Maybe we can put it in examples/vllm_opanaiserver.py or something?

@KeremTurgutlu
Copy link
Contributor Author

Sounds good, if you are fine with it to be added as an example I can do that and maybe we can add a little note in the readme for those who want to use the api server in vLLM. By the way is there a reason for you to prefer ray? Native vLLM fastapi server had been working fine for us so far but would love to learn more about ray's advantages. Thanks!

@mobicham
Copy link
Collaborator

Sounds good to me, feel free to do a PR!

It's because we support different backends, not just vllm, since we also need to run other non-llm models.
We have the SDK code open-source by the way: https://github.com/mobiusml/aana_sdk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants