Nomic-embeddings-support #280

I8dNLo · 2024-06-22T21:17:57Z

Fix of nomic-embeddings from #204

Also moved jina/miniLM models to refactored classes: PooledEmbedding and PooledNormalizedEmbedding, which implements the logic of embeddings from sentance-transformers lib for corresponding models

Todo:

move Jina embeddings to PooledNormalizedEmbedding class
Fix quantized model nomic-ai/nomic-embed-text-v1.5-Q canonical vector

Anush008

Hey.

~~Do we not want to add aliases with the old class names to avoid breaking changes?~~

Never mind. Someone would not be accessing those classes directly. Hopefully.

joein · 2024-07-09T10:08:30Z

tests/test_text_onnx_embeddings.py

    ),
    "nomic-ai/nomic-embed-text-v1.5": np.array(
-        [-1.6531514e-02, 8.5380634e-05, -1.8171231e-01, -3.9333291e-03, 1.2763254e-02]
+        [-0.15407836, -0.03053198, -3.9138033, 0.1910364, 0.13224715]


Were these embeddings obtained differently from nomic-ai/nomic-embed-text-v1 ?

Why do they have more digits after the point?

twellck · 2024-08-08T15:29:54Z

Hi @I8dNLo,

I'm currently working on adding support for Matryoshka Representation Learning embedding models, the main one being nomic-ai/nomic-embed-text-v1.5.

From the rest of the library as well as Nomic's documentation it seems we typically do normalize embeddings.
However, it seems that from this PR the embeddings for nomic models went from normalized to "nonnormalized", while this is not an issue when using cosine similarity it could have caused issues (or might cause issues to revert back) for people that have been using dot product.

I haven't been able to find the logic from sentence-transformers you are referring to when it comes to normalization, it appears that normalization is not tied to the model in sentence-transformers. Were you mainly referring to the mean_pooling method?

Ultimately, I am able to implement for variable dimensionality without normalization (even if it isn't optimal and not recommended for nomic-ai/nomic-embed-text-v1.5), I would like to avoid going down this path and have inconsistencies in the way different models are handled when it isn't related to the model's requirements.

I'd be happy to your opinion as well as other contributors on whether to go ahead with my current changes:
I.e. merge the normalized_pooled & the pooled embedding classes & normalization on my new matryoshka embedding classes) OR instead keep nomic models' embeddings nonnormalized?

d.rudenko added 3 commits June 22, 2024 23:53

Nomic-embeddings-support

59b5614

Jina models moved to pooled-normalized embeddings

92d151f

Canonical vector for nomic-ai/nomic-embed-text-v1.5-Q

c757d6e

I8dNLo changed the title ~~Draft: Nomic-embeddings-support~~ Nomic-embeddings-support Jun 24, 2024

I8dNLo requested review from joein, generall and Anush008 June 24, 2024 13:06

Anush008 reviewed Jun 26, 2024

View reviewed changes

Moved all nomics to pooled_embeddings

9e6f6ab

Anush008 approved these changes Jul 7, 2024

View reviewed changes

joein reviewed Jul 9, 2024

View reviewed changes

joein approved these changes Jul 9, 2024

View reviewed changes

joein merged commit d09af55 into qdrant:main Jul 10, 2024
15 checks passed

twellck mentioned this pull request Aug 8, 2024

incorrect nomic embeddings #204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomic-embeddings-support #280

Nomic-embeddings-support #280

I8dNLo commented Jun 22, 2024 •

edited

Loading

Anush008 left a comment •

edited

Loading

joein Jul 9, 2024

twellck commented Aug 8, 2024

Nomic-embeddings-support #280

Nomic-embeddings-support #280

Conversation

I8dNLo commented Jun 22, 2024 • edited Loading

Anush008 left a comment • edited Loading

Choose a reason for hiding this comment

joein Jul 9, 2024

Choose a reason for hiding this comment

twellck commented Aug 8, 2024

I8dNLo commented Jun 22, 2024 •

edited

Loading

Anush008 left a comment •

edited

Loading