Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] generate from input embeds #6869

Open
wants to merge 126 commits into
base: main
Choose a base branch
from

Conversation

Nan2018
Copy link

@Nan2018 Nan2018 commented Jul 27, 2024

adds support for passing prompt_embeds to LLM.generate as

llm.generate({"prompt_embeds": input_embeds}, sampling_params)

or

llm.generate(
    [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)

this enables use cases when only the embedding layer is finetuned, and have the same model backend support multiple custom tuned embedding layers

FIX #416
FIX #8323

inspired by #1265 which is very outdated

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@Nan2018
Copy link
Author

Nan2018 commented Aug 8, 2024

@WoosukKwon @ywang96 @robertgshaw2-neuralmagic

the failed tests with
ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
seems not related to my changes and I can't reproduce it locally.

other than that, this is ready for review

@ywang96
Copy link
Member

ywang96 commented Aug 8, 2024

@WoosukKwon @ywang96 @robertgshaw2-neuralmagic

the failed tests with ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating seems not related to my changes and I can't reproduce it locally.

other than that, this is ready for review

This is due to a recent change from transformers of deprecating default chat template, and it should have been fixed by #7238. Can you merge your branch with main again?

@Nan2018 Nan2018 force-pushed the feature-input-embeds branch from 3f79009 to dfd9301 Compare September 4, 2024 22:27
@DarkLight1337
Copy link
Member

@DarkLight1337 It seems that this feature is almost ready to be used?

It is incompatible with the recent changes to vLLM internals. I'm waiting for the dust to settle for V2 engine before picking this back up.

@lzl-mt
Copy link

lzl-mt commented Nov 12, 2024

@DarkLight1337 It seems that this feature is almost ready to be used?

It is incompatible with the recent changes to vLLM internals. I'm waiting for the dust to settle for V2 engine before picking this back up.

Can I run it directly using your branch?

@DarkLight1337
Copy link
Member

Can I run it directly using your branch?

No, the current branch is broken because I have already merged in those changes from main.

@DarkLight1337
Copy link
Member

I think the latest working commit is 49fe3f7

@lzl-mt
Copy link

lzl-mt commented Nov 12, 2024

Can I run it directly using your branch?

No, the current branch is broken because I have already merged in those changes from main.

Umm.. Is there a working commit I can reset back to?

@Nan2018
Copy link
Author

Nan2018 commented Nov 14, 2024

Hmm, I see. I'm quite against changing the semantics of forward to only work on embedded inputs though since that would cause some confusion for developers who are used to working with HF models. Could we instead selectively torch.compile individual methods? (Not sure why the current decorator is hardcoded to compile forward method)

Perhaps an option is to move the embedding logic to ModelForCausalLM.forward and compile the Model classes instead of the ModelForCausalLM classes?

@toilaluan
Copy link

@DarkLight1337 Do you guys have ETA of this feature?

@DarkLight1337
Copy link
Member

@DarkLight1337 Do you guys have ETA of this feature?

No ETA yet, see when V2 re-arch is done.

@DarkLight1337
Copy link
Member

I think that this PR will eventually be superseded by the follow-up work to #10374 which should make it easy to support embedding inputs.

@DaoD
Copy link

DaoD commented Nov 25, 2024

Hi there, if there is any update for this PR?

@DarkLight1337
Copy link
Member

Hi there, if there is any update for this PR?

Please read the above comment.

Copy link

mergify bot commented Dec 11, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Nan2018.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 11, 2024
@CandiedCode
Copy link

@DarkLight1337 since #10374 has landed, is this pr going to be updated so that this work can be finished, or is there a new PR you can reference for tracking?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 19, 2024

@DarkLight1337 since #10374 has landed, is this pr going to be updated so that this work can be finished, or is there a new PR you can reference for tracking?

Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag.

@CandiedCode
Copy link

CandiedCode commented Dec 19, 2024

Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag.

Thanks for the reference, @DarkLight1337 . Is it possible also to add this to the roadmap for additional visibility as well?

@ywang96
Copy link
Member

ywang96 commented Dec 19, 2024

Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag.

Thanks for the reference, @DarkLight1337 . Is it possible also to add this to the roadmap for additional visibility as well?

Hey @CandiedCode! Thanks for following up on this.

IMO supporting embeddings as input as is does not have technical difficulty but we do want to be careful with the design to make it work with all other features we want to natively support on vLLM, especially now that we're going through re-architecture. I have discussed briefly with @WoosukKwon in #11032 (comment) about it.

In particular, some issues we still need to design and address:

  • What happens if a batch has both token ids as input and embeddings as input?
  • Prefix caching (Currently we use token ids as hash key)
  • Spec decode (we assume draft models to output token id to be accepted by main model)

@HaiFengZeng
Copy link

using the code for inference,prompt_embeds.size= torch.Size([85, 896])

for tk in llm.generate({'prompt_embeds':prompt_input},sample_params):
    print(tk)

and it raise Exception as below:

rank0]: Traceback (most recent call last):
[rank0]:   File "/AIOT-vePFS/speech/workspace/zenghaifeng/sllm/CosyVoice2-0.5B/vllm-test.py", line 17, in <module>
[rank0]:     for tk in llm.generate({'prompt_embeds':prompt_input},sample_params):
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/utils.py", line 1063, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 396, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 931, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1415, in step
[rank0]:     ) = self.scheduler[virtual_engine].schedule()
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1219, in schedule
[rank0]:     scheduler_outputs: SchedulerOutputs = self._schedule()
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1178, in _schedule
[rank0]:     return self._schedule_default()
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1013, in _schedule_default
[rank0]:     prefills = self._schedule_prefills(budget,
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/core/scheduler.py", line 885, in _schedule_prefills
[rank0]:     num_new_tokens = self._get_num_new_tokens(seq_group,
[rank0]:   File "/root/miniconda3/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1605, in _get_num_new_tokens
[rank0]:     assert num_new_tokens > 0
[rank0]: AssertionError

how to solve this?

@DarkLight1337
Copy link
Member

I think the latest working commit is 49fe3f7

Please revert to the above commit. The current vLLM is not compatible with input embeds yet.

@serser
Copy link

serser commented Jan 2, 2025

It seems the length of a sequence of prompt_embeds is always 0 due to get_len which keeps the scheduler has nothing to schedule, hence deadlocking here. Am I right?

@DarkLight1337
Copy link
Member

It seems the length of a sequence of prompt_embeds is always 0 due to get_len which keeps the scheduler has nothing to schedule, hence deadlocking here. Am I right?

Yeah, that is the problem.

@solume
Copy link

solume commented Feb 5, 2025

is there a version of this that currently works for input_embed + prompt for text-only model?

@DarkLight1337
Copy link
Member

is there a version of this that currently works for input_embed + prompt for text-only model?

No, can you elaborate on the use case of mixing these two? vLLM already does prefix caching so this shouldn't be necessary.

@solume
Copy link

solume commented Feb 5, 2025

I'm using an embedding that comes externally as context that I need to prepend to the text. So I have a context vector the size of the hidden dimension of the LLM, and a prompt text. I have a patched vllm where I pass down an input_embeds array to the model (I'm using llama.py), and then I'm adding the prefix in the forward pass like this:


    def forward(
        ...
        inputs_embeds: Optional[torch.Tensor] = None,
    ) -> Union[torch.Tensor, IntermediateTensors]:
        ...
        if inputs_embeds is not None and input_ids is not None:
            text_embeds = self.get_input_embeddings(input_ids)
            inputs_embeds = torch.cat([inputs_embeds, text_embeds], dim=1)
            input_ids = None
        elif inputs_embeds is None:
            inputs_embeds = self.get_input_embeddings(input_ids)
            input_ids = None

        model_output = self.model(
            input_ids,
            positions,
            kv_caches,
            attn_metadata,
            intermediate_tensors,
            inputs_embeds,
        )

so was wondering if there is a way of doing this without such a patch.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Feb 6, 2025

I think you can use prompt adapter request (#4645) for the prepended embeddings. It's going to be deprecated in V1 though.

I suggest that you simply pass the context alongside the prompt in each request. There should be little overhead for doing so when prefix caching is enabled.

@CandiedCode
Copy link

CandiedCode commented Feb 10, 2025

Hi @ywang96 and @DarkLight1337

Last time you guys mentioned that following still needed to be designed and address:

What happens if a batch has both token ids as input and embeddings as input?
Prefix caching (Currently we use token ids as hash key)
Spec decode (we assume draft models to output token id to be accepted by main model)

I wanted to check to see if there had been any progress made on these efforts.

Thanks for all that you guys do!

~CandiedCode

@DarkLight1337
Copy link
Member

Hi @ywang96 and @DarkLight1337

Last time you guys mentioned that following still needed to be designed and address:

What happens if a batch has both token ids as input and embeddings as input?
Prefix caching (Currently we use token ids as hash key)
Spec decode (we assume draft models to output token id to be accepted by main model)

I wanted to check to see if there had been any progress made on these efforts.

Thanks for all that you guys do!

~CandiedCode

To avoid complicating the development of V1, we haven't really touched this yet. We'll consider this once it is more stable.

@amuvarma13
Copy link

Any progress on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet