[Audio] Support Audio Datasets #1085

kylesayrs · 2025-01-20T20:53:16Z

Purpose

Support oneshot with audio datasets

Changes

Extend apply_pad_mask_to_batch to handle cases where there are no input_ids and where there might be decoder_input_ids
Extend TextGenerationDataset to detect if a dataset is already tokenized based on processor.model_input_names rather than only input_ids

Testing

Ran test_processors.py to completion, which verifies that the model_input_names attribute is defined for most processors
Ran whisper to completion in [Audio] Qwen Audio Example #1082

test_processors.py

import pytest
from transformers import AutoProcessor

@pytest.mark.parametrize(
    "model_id,expected",
    [
        ("meta-llama/Meta-Llama-3-8B-Instruct", ["input_ids", "attention_mask"]),
        ("mistralai/Mixtral-8x7B-Instruct-v0.1", ["input_ids", "attention_mask"]),
        (
            "Qwen/Qwen2-VL-2B-Instruct",
            [
                "input_ids",
                "attention_mask",
                "pixel_values",
                "image_grid_thw",
                "pixel_values_videos",
                "video_grid_thw",
            ],
        ),
        ("mgoin/pixtral-12b", ["input_ids", "attention_mask", "pixel_values"]),
        ("openai/whisper-large-v2", ["input_features"]),
        (
            "Qwen/Qwen2-Audio-7B-Instruct",
            ["input_ids", "attention_mask", "input_features", "feature_attention_mask"],
        ),
    ],
)
def test_processor_model_input_names(model_id, expected):
    """
    Tests the model_input_names attribute of common model processors
    """

    processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
    assert processor.model_input_names == expected

Signed-off-by: Kyle Sayers <[email protected]>

github-actions · 2025-01-20T20:53:27Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

src/llmcompressor/transformers/finetune/data/base.py

Signed-off-by: Kyle Sayers <[email protected]>

src/llmcompressor/modifiers/utils/pytorch_helpers.py

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added 2 commits January 20, 2025 20:46

support audio datasets

2f3a416

Signed-off-by: Kyle Sayers <[email protected]>

mask decoder_input_ids

74283e8

Signed-off-by: Kyle Sayers <[email protected]>

This was referenced Jan 20, 2025

[Audio] Qwen Audio Example #1082

Draft

[Audio] People's Speech dataset and tracer tool #1086

Open

kylesayrs self-assigned this Jan 20, 2025

kylesayrs added the ready When a PR is ready for review label Jan 20, 2025

dsikka reviewed Jan 20, 2025

View reviewed changes

src/llmcompressor/transformers/finetune/data/base.py Show resolved Hide resolved

add comment

498e598

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs requested a review from dsikka January 22, 2025 14:58

Merge branch 'main' into kylesayrs/audio-datasets

bbf26c6

dsikka previously approved these changes Jan 22, 2025

View reviewed changes

Merge branch 'main' into kylesayrs/audio-datasets

9aab7b9

horheynm reviewed Jan 22, 2025

View reviewed changes

src/llmcompressor/modifiers/utils/pytorch_helpers.py Outdated Show resolved Hide resolved

rewrite for clarity

c862c0f

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed dsikka’s stale review via c862c0f January 22, 2025 21:09

kylesayrs requested review from dsikka and horheynm January 22, 2025 21:10

Merge branch 'main' into kylesayrs/audio-datasets

a476597

dsikka approved these changes Jan 22, 2025

View reviewed changes

horheynm approved these changes Jan 22, 2025

View reviewed changes

mgoin approved these changes Jan 22, 2025

View reviewed changes

mgoin merged commit fb01d66 into main Jan 22, 2025
6 of 7 checks passed

mgoin deleted the kylesayrs/audio-datasets branch January 22, 2025 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Audio] Support Audio Datasets #1085

[Audio] Support Audio Datasets #1085

kylesayrs commented Jan 20, 2025

github-actions bot commented Jan 20, 2025

[Audio] Support Audio Datasets #1085

[Audio] Support Audio Datasets #1085

Conversation

kylesayrs commented Jan 20, 2025

Purpose

Changes

Testing

github-actions bot commented Jan 20, 2025