[Model] add colqwen2_vl code & inference #14291
Open
+470
−20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add support for ColQwen2VL model
Description
This PR adds support for the ColQwen2VL model to vLLM. ColQwen2VL is an efficient document retrieval vision language model based on Qwen2VL, as described in the paper "ColPali: Efficient Document Retrieval with Vision Language Models". The model is designed to generate embeddings rather than text outputs, making it suitable for document retrieval applications.
Key implementation details:
Extended the existing Qwen2VL implementation for ColQwen2VL compatibility
Implemented custom text projection layer and L2 normalization for embedding generation
Added appropriate processing utilities for image and video inputs
Overrode forward, compute_logits and sample methods to optimize for embedding output
This implementation enables users to leverage ColQwen2VL's multimodal document retrieval capabilities through vLLM's efficient serving infrastructure.
Testing
Tested with sample image inputs
Verified embedding output format and dimensions
Confirmed compatibility with HuggingFace ColQwen2VL models