Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Model]: Request for LLaVA-Video-7B-Qwen2 Model Implementation #13190

Closed
1 task done
Noctis-SC opened this issue Feb 13, 2025 · 4 comments
Closed
1 task done

[New Model]: Request for LLaVA-Video-7B-Qwen2 Model Implementation #13190

Noctis-SC opened this issue Feb 13, 2025 · 4 comments
Labels
new model Requests to new models

Comments

@Noctis-SC
Copy link

The model to consider.

Hello, I would like to ask for this model (LLaVA-Video-7B-Qwen2) conversion to hf-weights.

I’ve been testing llava-hf/LLaVA-NeXT-Video-7B-32K-hf (vicuna base) and LLaVA-Video-7B-Qwen2 (qwen2 base), and I’ve noticed that LLaVA-Video-7B-Qwen2 LLM outperforms LLaVA-NeXT-Video-7B-32K-hf in terms of video understanding (video description). Given the performance difference, I would like to request an implementation of a new model LLaVA-Video-7B-Qwen2.

Thanks!

The closest model vllm already supports.

No response

What's your difficulty of supporting the model you want?

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Noctis-SC Noctis-SC added the new model Requests to new models label Feb 13, 2025
@Isotr0py
Copy link
Collaborator

Isotr0py commented Feb 18, 2025

@Noctis-SC Seems that this model has same architecture with llava-onevision, you can convert this model to hf format llava-onevision model using this modified script: convert_llava_onevision_weights_to_hf.py

I will try to open a PR in transformers repo to update their llava-onevision conversion script to support LLaVA-Video models.

@Noctis-SC
Copy link
Author

@Noctis-SC Seems that this model has same architecture with llava-onevision, you can convert this model to hf format llava-onevision model using this modified script: convert_llava_onevision_weights_to_hf.py

I will try to open a PR in transformers repo to update their llava-onevision conversion script to support LLaVA-Video models.

Hey, thank you for your reply. I've tried to follow your script to convert the model. It stuck on the "Single forward pass" step for a couple of hours is it how it is supposed to be?

2025-02-24 14:28:19,108 - INFO - Starting conversion for model: lmms-lab/LLaVA-Video-7B-Qwen2 2025-02-24 14:28:19,109 - INFO - Memory usage: 491.24 MB 2025-02-24 14:28:19,109 - INFO - Loading original config... {'_name_or_path': '/mnt/bn/vl-research/checkpoints/onevision/llavanext-google_siglip-so400m-patch14-384-Qwen_Qwen2-7B-Instruct-mid_to_final_next_2p4m_am9', 'add_time_instruction': True, 'add_faster_video': False, 'architectures': ['LlavaQwenForCausalLM'], 'attention_dropout': 0.0, 'bos_token_id': 151643, 'eos_token_id': 151645, 'force_sample': True, 'hidden_act': 'silu', 'hidden_size': 3584, 'image_aspect_ratio': 'anyres_max_9', 'image_crop_resolution': None, 'image_grid_pinpoints': [[384, 384], [384, 768], [384, 1152], [384, 1536], [384, 1920], [384, 2304], [768, 384], [768, 768], [768, 1152], [768, 1536], [768, 1920], [768, 2304], [1152, 384], [1152, 768], [1152, 1152], [1152, 1536], [1152, 1920], [1152, 2304], [1536, 384], [1536, 768], [1536, 1152], [1536, 1536], [1536, 1920], [1536, 2304], [1920, 384], [1920, 768], [1920, 1152], [1920, 1536], [1920, 1920], [1920, 2304], [2304, 384], [2304, 768], [2304, 1152], [2304, 1536], [2304, 1920], [2304, 2304]], 'image_split_resolution': None, 'initializer_range': 0.02, 'intermediate_size': 18944, 'max_position_embeddings': 32768, 'image_token_index': 151646, 'max_window_layers': 28, 'mm_hidden_size': 1152, 'mm_newline_position': 'grid', 'mm_patch_merge_type': 'spatial_unpad', 'mm_projector_lr': None, 'mm_projector_type': 'mlp2x_gelu', 'mm_resampler_type': None, 'mm_spatial_pool_mode': 'bilinear', 'mm_tunable_parts': 'mm_vision_tower,mm_mlp_adapter,mm_language_model', 'mm_use_im_patch_token': False, 'mm_use_im_start_end': False, 'mm_vision_select_feature': 'patch', 'mm_vision_select_layer': -2, 'mm_vision_tower': 'google/siglip-so400m-patch14-384', 'mm_vision_tower_lr': 2e-06, 'model_type': 'llava', 'num_attention_heads': 28, 'num_hidden_layers': 28, 'num_key_value_heads': 4, 'pos_skipping_range': 4096, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000.0, 'sliding_window': 131072, 'tie_word_embeddings': False, 'tokenizer_model_max_length': 32768, 'tokenizer_padding_side': 'right', 'torch_dtype': 'bfloat16', 'transformers_version': '4.40.0.dev0', 'use_cache': True, 'use_mm_proj': True, 'use_pos_skipping': False, 'use_sliding_window': False, 'vision_tower_pretrained': None, 'vocab_size': 152064} 2025-02-24 14:28:21,295 - INFO - Loading original state dict... Fetching 4 files: 100%|██████████████████████████████████████████| 4/4 [00:00<00:00, 61680.94it/s] 2025-02-24 14:28:21,538 - INFO - Memory usage: 592.33 MB 2025-02-24 14:28:25,659 - INFO - Memory usage: 2189.43 MB The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use mean_resizing=FalseThe new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, usemean_resizing=FalseSaving model and processor for lmms-lab/LLaVA-Video-7B-Qwen2 to /root/autodl-tmp/sanya/LLava_Next_video_converted 2025-02-24 14:28:55,690 - INFO - Memory usage: 2233.65 MB Loading checkpoint shards: 100%|████████████████████████████████████| 4/4 [00:00<00:00, 7.79it/s] Using a slow image processor asuse_fastis unset and a slow processor was saved with this model.use_fast=Truewill be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor withuse_fast=False. /root/autodl-tmp/sanya/transformers/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py:268: FutureWarning: You are using torch.loadwithweights_only=False(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_onlywill be flipped toTrue. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. original_pixel_values = torch.load(filepath, map_location="cpu") Single forward pass

@Isotr0py
Copy link
Collaborator

@Noctis-SC I have converted and pushed it to hf_hub. Can you check this model repo? (https://huggingface.co/Isotr0py/LLaVA-Video-7B-Qwen2-hf)

@Noctis-SC
Copy link
Author

@Noctis-SC I have converted and pushed it to hf_hub. Can you check this model repo? (https://huggingface.co/Isotr0py/LLaVA-Video-7B-Qwen2-hf)

Thank you. I run this model and it seems it works fine with vllm. These days I will test this model. But for now there is no questions from my side. Thanks again for quick responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models
Projects
None yet
Development

No branches or pull requests

2 participants