The official pytorch implement of "Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification".
Keep your workspace path is in the code, and then:
conda create -n llava python=3.10 -y
conda activate dynamic_llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Dynamic-LLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/data
,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
To Dynamic-LLaVA, you can get the base checkpoints in [LLaVA-1.5-7B] and [LLaVA-1.5-13B] for training Dynamic-LLaVA-7B and Dynamic-LLaVA-13B, respectively.
We provide the training scripts for Dynamic-LLaVA-7B and Dynamic-LLaVA-13B, while you can find in run
.
For training Dynamic-LLaVA-7B, you can directly conduct the shell run/train_dynamic_llava_7b.sh
, the detailed command is as follows:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/train_dynamic_llava_7b.sh
The details of run/train_dynamic_llava_7b.sh
are as follows:
#!/bin/bash
deepspeed llava/train/train_sparse.py \
--deepspeed ./scripts/zero3.json \
--model_name_or_path [llava-v1.5-7b] \ # your open-resource checkpoint path
--version v1 \
--data_path [./playground/data/llava_v1_5_mix665k.json] \ # your instruct-following dataset
--image_folder [./playground/data] \ # your instruct-following dataset
--vision_tower openai/clip-vit-large-patch14-336 \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length False \
--requires_image True \
--bf16 True \
--output_dir ./results/dynamic-llava-7b \
--num_train_epochs 1.0 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 40000 \
--save_total_limit 1 \
--learning_rate 5e-6 \
--weight_decay 0. \
--predictor_lr 2e-4 \
--predictor_weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb \
--mask_loss_weight 100.0 \
--gumbel_start_tau 1.0 \
--gumbel_end_tau 0.1 \
--use_vision_predictor True \
--use_text_predictor True \
--use_output_text_predictor True \
--use_instruct_predictor False \
--vision_keep_rate 0.2 \
--output_text_keep_rate 0.5 \
--output_text_len_for_training 50 \
For training Dynamic-LLaVA-13B, you can directly conduct the shell run/train_dynamic_llava_13b.sh
, the detailed command is as follows:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/train_dynamic_llava_13b.sh
The details of run/train_dynamic_llava_13b.sh
are as follows:
#!/bin/bash
deepspeed llava/train/train_sparse.py \
--deepspeed ./scripts/zero3.json \
--model_name_or_path [llava-v1.5-13b] \ # your open-resource checkpoint path
--version v1 \
--data_path [./playground/data/llava_v1_5_mix665k.json] \ # your instruct-following dataset
--image_folder [./playground/data] \ # your instruct-following dataset
--vision_tower openai/clip-vit-large-patch14-336 \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length False \
--requires_image True \
--bf16 True \
--output_dir ./results/dynamic-llava-13b \
--num_train_epochs 1.0 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 40000 \
--save_total_limit 1 \
--learning_rate 5e-6 \
--weight_decay 0. \
--predictor_lr 2e-4 \
--predictor_weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb \
--mask_loss_weight 100.0 \
--gumbel_start_tau 1.0 \
--gumbel_end_tau 0.1 \
--use_vision_predictor True \
--use_text_predictor True \
--use_output_text_predictor True \
--use_instruct_predictor False \
--vision_keep_rate 0.2 \
--output_text_keep_rate 0.5 \
--output_text_len_for_training 50 \
We provide the evaluation scripts to evaluate the benchmarks.
For evaluate Dynamic-LLaVA-7B in VQAv2 benchmark, you can directly conduct the shell run/dynamic_eval/eval_for_vqav2.sh
, the detailed command is as follows:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/dynamic_eval/eval_for_vqav2.sh
The details of run/dynamic_eval/eval_for_vqav2.sh
are as follows:
#!/bin/bash
gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
IFS=',' read -ra GPULIST <<< "$gpu_list"
CHUNKS=${#GPULIST[@]}
CKPT="dynamic-llava-7b"
SPLIT="llava_vqav2_mscoco_test-dev2015"
for IDX in $(seq 0 $((CHUNKS-1))); do
CUDA_VISIBLE_DEVICES=${GPULIST[$IDX]} python -m llava.dynamic_eval.model_vqa_loader \
--model-path [./results/dynamic-llava-7b] \ # your Dynamic-LLaVA checkpoint path
--question-file [./playground/data/eval/vqav2/$SPLIT.jsonl] \ # your benchmark path
--image-folder [./playground/data/eval/vqav2/test2015] \ # your benchmark path
--answers-file ./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl \
--num-chunks $CHUNKS \
--chunk-idx $IDX \
--temperature 0 \
--conv-mode vicuna_v1 &
done
wait
output_file=./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/merge.jsonl
# Clear out the output file if it exists.
> "$output_file"
# Loop through the indices and concatenate each file.
for IDX in $(seq 0 $((CHUNKS-1))); do
cat ./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl >> "$output_file"
done
python scripts/convert_vqav2_for_submission.py --split $SPLIT --ckpt $CKPT \
--test_dir "./playground/data/eval/vqav2" \
--result_dir "./playground/data/eval/vqav2"
And then, submit the results to the evaluation server: ./playground/data/eval/vqav2/answers_upload
.
For evaluate Dynamic-LLaVA-7B in GQA benchmark, you can directly conduct the shell run/dynamic_eval/eval_for_gqa.sh
, the detailed command is as follows:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/dynamic_eval/eval_for_gqa.sh
The details of run/dynamic_eval/eval_for_gqa.sh
are as follows:
#!/bin/bash
gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
IFS=',' read -ra GPULIST <<< "$gpu_list"
CHUNKS=${#GPULIST[@]}
CKPT="dynamic-llava-7b"
SPLIT="llava_gqa_testdev_balanced"
GQADIR="[./playground/data/eval/gqa/data]" # your benchmark path
for IDX in $(seq 0 $((CHUNKS-1))); do
CUDA_VISIBLE_DEVICES=${GPULIST[$IDX]} python -m llava.dynamic_eval.model_vqa_loader \
--model-path [./results/dynamic-llava-7b] \ # your Dynamic-LLaVA checkpoint path
--question-file [./playground/data/eval/gqa/$SPLIT.jsonl] \ # your benchmark path
--image-folder [./playground/data/eval/gqa/data/images] \ # your benchmark path
--answers-file ./playground/data/eval/gqa/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl \
--num-chunks $CHUNKS \
--chunk-idx $IDX \
--temperature 0 \
--conv-mode vicuna_v1 &
done
wait
output_file=./playground/data/eval/gqa/answers/$SPLIT/$CKPT/merge.jsonl
# Clear out the output file if it exists.
> "$output_file"
# Loop through the indices and concatenate each file.
for IDX in $(seq 0 $((CHUNKS-1))); do
cat ./playground/data/eval/gqa/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl >> "$output_file"
done
python scripts/convert_gqa_for_eval.py --src $output_file --dst $GQADIR/testdev_balanced_predictions.json
cd $GQADIR
python eval/eval.py --tier testdev_balanced
This project is based on LLaVA. Thanks for their wonderful works.