-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] generate from input embeds #6869
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
@WoosukKwon @ywang96 @robertgshaw2-neuralmagic the failed tests with other than that, this is ready for review |
This is due to a recent change from transformers of deprecating default chat template, and it should have been fixed by #7238. Can you merge your branch with main again? |
3f79009
to
dfd9301
Compare
It is incompatible with the recent changes to vLLM internals. I'm waiting for the dust to settle for V2 engine before picking this back up. |
Can I run it directly using your branch? |
No, the current branch is broken because I have already merged in those changes from |
I think the latest working commit is 49fe3f7 |
Umm.. Is there a working commit I can reset back to? |
Perhaps an option is to move the embedding logic to |
@DarkLight1337 Do you guys have ETA of this feature? |
No ETA yet, see when V2 re-arch is done. |
I think that this PR will eventually be superseded by the follow-up work to #10374 which should make it easy to support embedding inputs. |
Hi there, if there is any update for this PR? |
Please read the above comment. |
This pull request has merge conflicts that must be resolved before it can be |
@DarkLight1337 since #10374 has landed, is this pr going to be updated so that this work can be finished, or is there a new PR you can reference for tracking? |
Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag. |
Thanks for the reference, @DarkLight1337 . Is it possible also to add this to the roadmap for additional visibility as well? |
Hey @CandiedCode! Thanks for following up on this. IMO supporting embeddings as input as is does not have technical difficulty but we do want to be careful with the design to make it work with all other features we want to natively support on vLLM, especially now that we're going through re-architecture. I have discussed briefly with @WoosukKwon in #11032 (comment) about it. In particular, some issues we still need to design and address:
|
using the code for inference,
and it raise Exception as below:
how to solve this? |
Please revert to the above commit. The current vLLM is not compatible with input embeds yet. |
is there a version of this that currently works for input_embed + prompt for text-only model? |
No, can you elaborate on the use case of mixing these two? vLLM already does prefix caching so this shouldn't be necessary. |
I'm using an embedding that comes externally as context that I need to prepend to the text. So I have a context vector the size of the hidden dimension of the LLM, and a prompt text. I have a patched vllm where I pass down an input_embeds array to the model (I'm using llama.py), and then I'm adding the prefix in the forward pass like this:
so was wondering if there is a way of doing this without such a patch. |
I think you can use prompt adapter request (#4645) for the prepended embeddings. It's going to be deprecated in V1 though. I suggest that you simply pass the context alongside the prompt in each request. There should be little overhead for doing so when prefix caching is enabled. |
Hi @ywang96 and @DarkLight1337 Last time you guys mentioned that following still needed to be designed and address:
I wanted to check to see if there had been any progress made on these efforts. Thanks for all that you guys do! ~CandiedCode |
To avoid complicating the development of V1, we haven't really touched this yet. We'll consider this once it is more stable. |
Any progress on this? |
adds support for passing
prompt_embeds
toLLM.generate
asor
this enables use cases when only the embedding layer is finetuned, and have the same model backend support multiple custom tuned embedding layers
FIX #416
FIX #8323
inspired by #1265 which is very outdated