Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Audio] Support Audio Datasets #1085

Merged
merged 7 commits into from
Jan 22, 2025
Merged

[Audio] Support Audio Datasets #1085

merged 7 commits into from
Jan 22, 2025

Conversation

kylesayrs
Copy link
Collaborator

Purpose

  • Support oneshot with audio datasets

Changes

  • Extend apply_pad_mask_to_batch to handle cases where there are no input_ids and where there might be decoder_input_ids
  • Extend TextGenerationDataset to detect if a dataset is already tokenized based on processor.model_input_names rather than only input_ids

Testing

  • Ran test_processors.py to completion, which verifies that the model_input_names attribute is defined for most processors
  • Ran whisper to completion in [Audio] Qwen Audio Example #1082
test_processors.py
import pytest
from transformers import AutoProcessor

@pytest.mark.parametrize(
    "model_id,expected",
    [
        ("meta-llama/Meta-Llama-3-8B-Instruct", ["input_ids", "attention_mask"]),
        ("mistralai/Mixtral-8x7B-Instruct-v0.1", ["input_ids", "attention_mask"]),
        (
            "Qwen/Qwen2-VL-2B-Instruct",
            [
                "input_ids",
                "attention_mask",
                "pixel_values",
                "image_grid_thw",
                "pixel_values_videos",
                "video_grid_thw",
            ],
        ),
        ("mgoin/pixtral-12b", ["input_ids", "attention_mask", "pixel_values"]),
        ("openai/whisper-large-v2", ["input_features"]),
        (
            "Qwen/Qwen2-Audio-7B-Instruct",
            ["input_ids", "attention_mask", "input_features", "feature_attention_mask"],
        ),
    ],
)
def test_processor_model_input_names(model_id, expected):
    """
    Tests the model_input_names attribute of common model processors
    """

    processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
    assert processor.model_input_names == expected

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@kylesayrs kylesayrs self-assigned this Jan 20, 2025
@kylesayrs kylesayrs added the ready When a PR is ready for review label Jan 20, 2025
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs requested a review from dsikka January 22, 2025 14:58
dsikka
dsikka previously approved these changes Jan 22, 2025
Signed-off-by: Kyle Sayers <[email protected]>
@mgoin mgoin merged commit fb01d66 into main Jan 22, 2025
6 of 7 checks passed
@mgoin mgoin deleted the kylesayrs/audio-datasets branch January 22, 2025 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants