Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomic-embeddings-support #280

Merged
merged 4 commits into from
Jul 10, 2024
Merged

Nomic-embeddings-support #280

merged 4 commits into from
Jul 10, 2024

Conversation

I8dNLo
Copy link
Contributor

@I8dNLo I8dNLo commented Jun 22, 2024

Fix of nomic-embeddings from #204

Also moved jina/miniLM models to refactored classes: PooledEmbedding and PooledNormalizedEmbedding, which implements the logic of embeddings from sentance-transformers lib for corresponding models

Todo:

  • move Jina embeddings to PooledNormalizedEmbedding class
  • Fix quantized model nomic-ai/nomic-embed-text-v1.5-Q canonical vector

@I8dNLo I8dNLo changed the title Draft: Nomic-embeddings-support Nomic-embeddings-support Jun 24, 2024
@I8dNLo I8dNLo requested review from joein, generall and Anush008 June 24, 2024 13:06
Copy link
Member

@Anush008 Anush008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey.

Do we not want to add aliases with the old class names to avoid breaking changes?

Never mind. Someone would not be accessing those classes directly. Hopefully.

),
"nomic-ai/nomic-embed-text-v1.5": np.array(
[-1.6531514e-02, 8.5380634e-05, -1.8171231e-01, -3.9333291e-03, 1.2763254e-02]
[-0.15407836, -0.03053198, -3.9138033, 0.1910364, 0.13224715]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these embeddings obtained differently from nomic-ai/nomic-embed-text-v1 ?

Why do they have more digits after the point?

@joein joein merged commit d09af55 into qdrant:main Jul 10, 2024
15 checks passed
@twellck
Copy link

twellck commented Aug 8, 2024

Hi @I8dNLo,

I'm currently working on adding support for Matryoshka Representation Learning embedding models, the main one being nomic-ai/nomic-embed-text-v1.5.

From the rest of the library as well as Nomic's documentation it seems we typically do normalize embeddings.
However, it seems that from this PR the embeddings for nomic models went from normalized to "nonnormalized", while this is not an issue when using cosine similarity it could have caused issues (or might cause issues to revert back) for people that have been using dot product.

I haven't been able to find the logic from sentence-transformers you are referring to when it comes to normalization, it appears that normalization is not tied to the model in sentence-transformers. Were you mainly referring to the mean_pooling method?

Ultimately, I am able to implement for variable dimensionality without normalization (even if it isn't optimal and not recommended for nomic-ai/nomic-embed-text-v1.5), I would like to avoid going down this path and have inconsistencies in the way different models are handled when it isn't related to the model's requirements.

I'd be happy to your opinion as well as other contributors on whether to go ahead with my current changes:
I.e. merge the normalized_pooled & the pooled embedding classes & normalization on my new matryoshka embedding classes) OR instead keep nomic models' embeddings nonnormalized?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants