[Core] Support pooling #229

wangxiyuan · 2025-03-03T12:06:31Z

This PR added pooling support for vllm-ascend

Tested with bge-base-en-v1.5 by encode:

from vllm import LLM

# Sample prompts.
prompts = [
  "Hello, my name is",
  "The president of the United States is",
  "The capital of France is",
  "The future of AI is",
]
# Create an LLM.
model = LLM(model="./bge-base-en-v1.5", enforce_eager=True)
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.encode(prompts)
# Print the outputs.
for output in outputs:
    print(output.outputs.embedding)  # list of 4096 floats

Tested by embedding:

from vllm import LLM, SamplingParams

llm = LLM(model="./bge-base-en-v1.5", task="embed")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")

Related: #200 #235

Known issue

The accuracy is not correct since this feature rely on enc-dec support. It'll be done in the following PR by @MengqingCao

MengqingCao · 2025-03-03T12:34:36Z

Test LLM.score with BAAI/bge-reranker-v2-m3 locally, and raised NotImplementedError:

[rank0]:   File "/home/xxx/code/vllm-cpu/vllm/vllm/attention/layer.py", line 220, in forward
[rank0]:     return self.impl.forward(self, query, key, value,
[rank0]:   File "/home/xxx/code/vllm-ascend/vllm_ascend/attention.py", line 546, in forward
[rank0]:     raise NotImplementedError("Encoder self-attention and "
[rank0]: NotImplementedError: Encoder self-attention and encoder/decoder cross-attention are not implemented for AscendAttentionBackendImpl

Seems we need to add the support of encoder self-attention and encoder/decoder cross-attention.
But I'm fine with this pr for initial pooling model support, we can do the rest of the work in the following PRs

wangxiyuan · 2025-03-04T01:23:55Z

Test LLM.score with BAAI/bge-reranker-v2-m3 locally, and raised NotImplementedError:
[rank0]:   File "/home/xxx/code/vllm-cpu/vllm/vllm/attention/layer.py", line 220, in forward
[rank0]:     return self.impl.forward(self, query, key, value,
[rank0]:   File "/home/xxx/code/vllm-ascend/vllm_ascend/attention.py", line 546, in forward
[rank0]:     raise NotImplementedError("Encoder self-attention and "
[rank0]: NotImplementedError: Encoder self-attention and encoder/decoder cross-attention are not implemented for AscendAttentionBackendImpl
Seems we need to add the support of encoder self-attention and encoder/decoder cross-attention. But I'm fine with this pr for initial pooling model support, we can do the rest of the work in the following PRs

BAAI/bge-reranker-v2-m3 rely on Encode-only attention which is not support yet. I think we can do it in enc-dec feature.

vllm_ascend/worker/pooling_model_runner.py

Signed-off-by: wangxiyuan <[email protected]>

@MengqingCao

This PR added pooling support for vllm-ascend Tested with `bge-base-en-v1.5` by encode: ``` from vllm import LLM prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] model = LLM(model="./bge-base-en-v1.5", enforce_eager=True) outputs = model.encode(prompts) for output in outputs: print(output.outputs.embedding) # list of 4096 floats ``` Tested by embedding: ``` from vllm import LLM, SamplingParams llm = LLM(model="./bge-base-en-v1.5", task="embed") (output,) = llm.embed("Hello, my name is") embeds = output.outputs.embedding print(f"Embeddings: {embeds!r} (size={len(embeds)})") ``` Related: vllm-project#200 The accuracy is not correct since this feature rely on `enc-dec` support. It'll be done in the following PR by @MengqingCao Signed-off-by: wangxiyuan <[email protected]>

@MengqingCao

This PR added pooling support for vllm-ascend Tested with `bge-base-en-v1.5` by encode: ``` from vllm import LLM prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] model = LLM(model="./bge-base-en-v1.5", enforce_eager=True) outputs = model.encode(prompts) for output in outputs: print(output.outputs.embedding) # list of 4096 floats ``` Tested by embedding: ``` from vllm import LLM, SamplingParams llm = LLM(model="./bge-base-en-v1.5", task="embed") (output,) = llm.embed("Hello, my name is") embeds = output.outputs.embedding print(f"Embeddings: {embeds!r} (size={len(embeds)})") ``` Related: vllm-project#200 The accuracy is not correct since this feature rely on `enc-dec` support. It'll be done in the following PR by @MengqingCao Signed-off-by: wangxiyuan <[email protected]>

github-actions bot added the module:core label Mar 3, 2025

wangxiyuan force-pushed the pooling branch from 2114cf9 to 78fa449 Compare March 3, 2025 12:11

wangxiyuan mentioned this pull request Mar 3, 2025

[Bug]: TypeError: AscendMetadataBuilder.build() takes 3 positional arguments but 5 were given #200

Closed

wangxiyuan force-pushed the pooling branch 3 times, most recently from e50e772 to e682021 Compare March 4, 2025 03:03

Yikun reviewed Mar 4, 2025

View reviewed changes

vllm_ascend/worker/pooling_model_runner.py Show resolved Hide resolved

[Core] Support pooling

61fe367

Signed-off-by: wangxiyuan <[email protected]>

wangxiyuan force-pushed the pooling branch from e682021 to 61fe367 Compare March 4, 2025 06:32

github-actions bot added the documentation Improvements or additions to documentation label Mar 4, 2025

Yikun approved these changes Mar 4, 2025

View reviewed changes

wangxiyuan merged commit ae49bfd into vllm-project:main Mar 4, 2025
11 checks passed

wangxiyuan deleted the pooling branch March 4, 2025 08:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Support pooling #229

[Core] Support pooling #229

wangxiyuan commented Mar 3, 2025 •

edited by Yikun

Loading

MengqingCao commented Mar 3, 2025

wangxiyuan commented Mar 4, 2025

[Core] Support pooling #229

[Core] Support pooling #229

Conversation

wangxiyuan commented Mar 3, 2025 • edited by Yikun Loading

Known issue

MengqingCao commented Mar 3, 2025

wangxiyuan commented Mar 4, 2025

wangxiyuan commented Mar 3, 2025 •

edited by Yikun

Loading