Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLIPScore with different multimodal models for longer captions #2906

Open
arijit-hub opened this issue Jan 15, 2025 · 1 comment · May be fixed by #2978
Open

CLIPScore with different multimodal models for longer captions #2906

arijit-hub opened this issue Jan 15, 2025 · 1 comment · May be fixed by #2978
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@arijit-hub
Copy link

🚀 Feature

Most of us know that CLIPScore for long captions doesn't work well. As such torchmetrics truncate the text embedding tokens to 77 to give correct results. However, it would be nice to have a metric which do check long caption score. As such I would like to propose the use of Jina Clip v2 for long caption clip score calculation. We could add it on top of the CLIPScore metric (as its very similar; add a string for jina in the model_name_or_path) or setup its own metric like below.

class CLIPJinaScore(Metric):
    """Implements the CLIPScore using the Jina-CLIP-v2 model."""

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.model, self.processor = self._get_jina_model_and_processor()
        self.add_state("score", torch.tensor(0.0), dist_reduce_fx="sum")
        self.add_state(
            "n_samples", torch.tensor(0, dtype=torch.long), dist_reduce_fx="sum"
        )

    def update(self, images, text):
        """Update score on a batch of images and text."""
        score, n_samples = self._score_update(images, text, self.model, self.processor)
        self.score += score.sum(0)
        self.n_samples += n_samples

    def compute(self):
        """Compute accumulated score."""
        return torch.max(self.score / self.n_samples, torch.zeros_like(self.score))

    def _get_jina_model_and_processor(self):
        """Returns the Jina-CLIP-v2 model and processor."""
        from transformers import AutoModel, AutoProcessor

        model = AutoModel.from_pretrained("jinaai/jina-clip-v2", trust_remote_code=True)

        processor = AutoProcessor.from_pretrained(
            "jinaai/jina-clip-v2", trust_remote_code=True
        )

        return model, processor

    def _score_update(self, images, text, model, processor):
        """Update score on a batch of images and text."""

        device = images[0].device

        processed_input = processor(
            text=text,
            images=[transforms.functional.to_pil_image(i.cpu()) for i in images], # the preprocessor takes pil only
            return_tensors="pt",
            padding=True,
        )

        img_features = model.get_image_features(
            processed_input["pixel_values"].to(device)
        )
        img_features = img_features / img_features.norm(p=2, dim=-1, keepdim=True)

        txt_features = model.get_text_features(
            processed_input["input_ids"].to(device),
            processed_input["attention_mask"].to(device),
        )
        txt_features = txt_features / txt_features.norm(p=2, dim=-1, keepdim=True)

        # cosine similarity between feature vectors
        score = 100 * (img_features * txt_features).sum(axis=-1)
        return score, len(text)

Let me know what you think.

cc @Borda @lantiga @awaelchli

@arijit-hub arijit-hub added the enhancement New feature or request label Jan 15, 2025
Copy link

Hi! thanks for your contribution!, great first issue!

@SkafteNicki SkafteNicki linked a pull request Feb 27, 2025 that will close this issue
4 tasks
@SkafteNicki SkafteNicki added this to the v1.6.0 milestone Feb 28, 2025
@SkafteNicki SkafteNicki self-assigned this Feb 28, 2025
@SkafteNicki SkafteNicki modified the milestones: v1.6.0, v1.7.0 Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants