[New Model]: Request for LLaVA-Video-7B-Qwen2 Model Implementation #13190

Noctis-SC · 2025-02-13T02:47:10Z

The model to consider.

Hello, I would like to ask for this model (LLaVA-Video-7B-Qwen2) conversion to hf-weights.

I’ve been testing llava-hf/LLaVA-NeXT-Video-7B-32K-hf (vicuna base) and LLaVA-Video-7B-Qwen2 (qwen2 base), and I’ve noticed that LLaVA-Video-7B-Qwen2 LLM outperforms LLaVA-NeXT-Video-7B-32K-hf in terms of video understanding (video description). Given the performance difference, I would like to request an implementation of a new model LLaVA-Video-7B-Qwen2.

Thanks!

The closest model vllm already supports.

No response

What's your difficulty of supporting the model you want?

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Isotr0py · 2025-02-18T16:27:25Z

@Noctis-SC Seems that this model has same architecture with llava-onevision, you can convert this model to hf format llava-onevision model using this modified script: convert_llava_onevision_weights_to_hf.py

I will try to open a PR in transformers repo to update their llava-onevision conversion script to support LLaVA-Video models.

Noctis-SC · 2025-02-24T06:42:49Z

@Noctis-SC Seems that this model has same architecture with llava-onevision, you can convert this model to hf format llava-onevision model using this modified script: convert_llava_onevision_weights_to_hf.py

I will try to open a PR in transformers repo to update their llava-onevision conversion script to support LLaVA-Video models.

Hey, thank you for your reply. I've tried to follow your script to convert the model. It stuck on the "Single forward pass" step for a couple of hours is it how it is supposed to be?

2025-02-24 14:28:19,108 - INFO - Starting conversion for model: lmms-lab/LLaVA-Video-7B-Qwen2 2025-02-24 14:28:19,109 - INFO - Memory usage: 491.24 MB 2025-02-24 14:28:19,109 - INFO - Loading original config... {'_name_or_path': '/mnt/bn/vl-research/checkpoints/onevision/llavanext-google_siglip-so400m-patch14-384-Qwen_Qwen2-7B-Instruct-mid_to_final_next_2p4m_am9', 'add_time_instruction': True, 'add_faster_video': False, 'architectures': ['LlavaQwenForCausalLM'], 'attention_dropout': 0.0, 'bos_token_id': 151643, 'eos_token_id': 151645, 'force_sample': True, 'hidden_act': 'silu', 'hidden_size': 3584, 'image_aspect_ratio': 'anyres_max_9', 'image_crop_resolution': None, 'image_grid_pinpoints': [[384, 384], [384, 768], [384, 1152], [384, 1536], [384, 1920], [384, 2304], [768, 384], [768, 768], [768, 1152], [768, 1536], [768, 1920], [768, 2304], [1152, 384], [1152, 768], [1152, 1152], [1152, 1536], [1152, 1920], [1152, 2304], [1536, 384], [1536, 768], [1536, 1152], [1536, 1536], [1536, 1920], [1536, 2304], [1920, 384], [1920, 768], [1920, 1152], [1920, 1536], [1920, 1920], [1920, 2304], [2304, 384], [2304, 768], [2304, 1152], [2304, 1536], [2304, 1920], [2304, 2304]], 'image_split_resolution': None, 'initializer_range': 0.02, 'intermediate_size': 18944, 'max_position_embeddings': 32768, 'image_token_index': 151646, 'max_window_layers': 28, 'mm_hidden_size': 1152, 'mm_newline_position': 'grid', 'mm_patch_merge_type': 'spatial_unpad', 'mm_projector_lr': None, 'mm_projector_type': 'mlp2x_gelu', 'mm_resampler_type': None, 'mm_spatial_pool_mode': 'bilinear', 'mm_tunable_parts': 'mm_vision_tower,mm_mlp_adapter,mm_language_model', 'mm_use_im_patch_token': False, 'mm_use_im_start_end': False, 'mm_vision_select_feature': 'patch', 'mm_vision_select_layer': -2, 'mm_vision_tower': 'google/siglip-so400m-patch14-384', 'mm_vision_tower_lr': 2e-06, 'model_type': 'llava', 'num_attention_heads': 28, 'num_hidden_layers': 28, 'num_key_value_heads': 4, 'pos_skipping_range': 4096, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000.0, 'sliding_window': 131072, 'tie_word_embeddings': False, 'tokenizer_model_max_length': 32768, 'tokenizer_padding_side': 'right', 'torch_dtype': 'bfloat16', 'transformers_version': '4.40.0.dev0', 'use_cache': True, 'use_mm_proj': True, 'use_pos_skipping': False, 'use_sliding_window': False, 'vision_tower_pretrained': None, 'vocab_size': 152064} 2025-02-24 14:28:21,295 - INFO - Loading original state dict... Fetching 4 files: 100%|██████████████████████████████████████████| 4/4 [00:00<00:00, 61680.94it/s] 2025-02-24 14:28:21,538 - INFO - Memory usage: 592.33 MB 2025-02-24 14:28:25,659 - INFO - Memory usage: 2189.43 MB The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use mean_resizing=FalseThe new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, usemean_resizing=FalseSaving model and processor for lmms-lab/LLaVA-Video-7B-Qwen2 to /root/autodl-tmp/sanya/LLava_Next_video_converted 2025-02-24 14:28:55,690 - INFO - Memory usage: 2233.65 MB Loading checkpoint shards: 100%|████████████████████████████████████| 4/4 [00:00<00:00, 7.79it/s] Using a slow image processor asuse_fastis unset and a slow processor was saved with this model.use_fast=Truewill be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor withuse_fast=False. /root/autodl-tmp/sanya/transformers/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py:268: FutureWarning: You are using torch.loadwithweights_only=False(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_onlywill be flipped toTrue. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. original_pixel_values = torch.load(filepath, map_location="cpu") Single forward pass

Isotr0py · 2025-02-25T06:05:31Z

@Noctis-SC I have converted and pushed it to hf_hub. Can you check this model repo? (https://huggingface.co/Isotr0py/LLaVA-Video-7B-Qwen2-hf)

Noctis-SC · 2025-02-27T04:16:00Z

@Noctis-SC I have converted and pushed it to hf_hub. Can you check this model repo? (https://huggingface.co/Isotr0py/LLaVA-Video-7B-Qwen2-hf)

Thank you. I run this model and it seems it works fine with vllm. These days I will test this model. But for now there is no questions from my side. Thanks again for quick responses.

Noctis-SC added the new model Requests to new models label Feb 13, 2025

DarkLight1337 mentioned this issue Feb 13, 2025

[RFC]: Multi-modality Support on vLLM #4194

Open

46 tasks

Noctis-SC closed this as completed Feb 27, 2025

DarkLight1337 mentioned this issue Mar 5, 2025

[New Model]: llava-onevision-qwen2-72b-ov-sft #14290

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model]: Request for LLaVA-Video-7B-Qwen2 Model Implementation #13190

[New Model]: Request for LLaVA-Video-7B-Qwen2 Model Implementation #13190

Noctis-SC commented Feb 13, 2025

Isotr0py commented Feb 18, 2025 •

edited

Loading

Noctis-SC commented Feb 24, 2025

Isotr0py commented Feb 25, 2025

Noctis-SC commented Feb 27, 2025

[New Model]: Request for LLaVA-Video-7B-Qwen2 Model Implementation #13190

[New Model]: Request for LLaVA-Video-7B-Qwen2 Model Implementation #13190

Comments

Noctis-SC commented Feb 13, 2025

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Before submitting a new issue...

Isotr0py commented Feb 18, 2025 • edited Loading

Noctis-SC commented Feb 24, 2025

Isotr0py commented Feb 25, 2025

Noctis-SC commented Feb 27, 2025

Isotr0py commented Feb 18, 2025 •

edited

Loading