Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Logprobs Scaling with O(n) Complexity – Unexpected Performance Degradation #14300

Open
1 task done
Rachum-thu opened this issue Mar 5, 2025 · 0 comments
Open
1 task done
Labels
usage How to use vllm

Comments

@Rachum-thu
Copy link

Rachum-thu commented Mar 5, 2025

Title: Logprobs Scaling with O(n) Complexity – Unexpected Performance Degradation

Description:
When increasing the logprobs parameter, I expected only a minor increase in runtime due to slicing the top-k values from the full vocabulary logits. However, my experiments show an almost O(n) increase in runtime, which suggests that retrieving logprobs is more computationally expensive than anticipated.

Reproduction Code

import time
from vllm import LLM
from vllm.sampling_params import SamplingParams

def test_generation_time(llm, logprobs_value, batch_size=32):
    sampling_params = SamplingParams(logprobs=logprobs_value, max_tokens=1)
    
    # Timed run
    start_time = time.time()
    output = llm.generate(["Tell me something about LLMs."] * batch_size,
                         sampling_params=sampling_params,
                         use_tqdm=False)
    end_time = time.time()
    
    return end_time - start_time

def main():
    print("Initializing model...")
    llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", max_logprobs=152_064)  # vocab size
    
    batch_size = 32
    logprobs_values = [10, 100, 1000, 10000, 100000, 152064]
    results = []
    
    print("\nStarting tests...")
    for logprobs in logprobs_values:
        time_taken = test_generation_time(llm, logprobs, batch_size)
        results.append((logprobs, time_taken))
    
    print("\nResults Summary:")
    print("╔══════════════╦═══════════════╗")
    print("║   Logprobs   ║  Time (secs)  ║")
    print("╠══════════════╬═══════════════╣")
    for logprobs, time_taken in results:
        print(f"║ {logprobs:^12}{time_taken:^13.4f} ║")
    print("╚══════════════╩═══════════════╝")

if __name__ == "__main__":
    main()

Observed Results

╔══════════════╦═══════════════╗
║   Logprobs   ║  Time (secs)  ║
╠══════════════╬═══════════════╣
║      10      ║    0.0784     ║
║     100      ║    0.0410     ║
║     1000     ║    0.1909     ║
║    10000     ║    1.9388     ║
║    100000    ║    19.9256    ║
║    152064    ║    29.2862    ║
╚══════════════╩═══════════════╝

Expected Behavior

Since the model inherently computes full logits for the vocabulary on every forward pass, I expected retrieving logprobs to involve only a minor computational overhead (e.g., sorting/selecting top-k). However, the results suggest that requesting more logprobs significantly increases runtime, implying an O(n) complexity scaling instead of an efficient selection from precomputed logits.

Questions:

  1. Why does increasing logprobs scale in an O(n) fashion?
    • Is the model recomputing or performing expensive operations instead of just slicing logits?
  2. Is there a way to retrieve logprobs for the full vocabulary without incurring this high runtime penalty?
  3. Would it be possible to expose full logits instead of just logprobs?

System Info:

  • vLLM version: 0.7.4.dev142+g9804145c.d20250228
  • Model: Qwen/Qwen2.5-7B-Instruct
  • CUDA Version: CUDA Version: 12.5

Looking forward to insights on whether this is expected behavior or a possible optimization opportunity! Thanks!

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Rachum-thu Rachum-thu added the usage How to use vllm label Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

1 participant