Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colbert #248

Merged
merged 6 commits into from
May 31, 2024
Merged

Colbert #248

merged 6 commits into from
May 31, 2024

Conversation

joein
Copy link
Member

@joein joein commented May 24, 2024

This PR adds classes for late interaction embeddings.
The only supported model at the moment is colbertv2.0

A few complexities in this PR are bound to the way colbert processes documents and queries.
It adds a special token [Q] or [D] as a second token in a sequence of tokens.
They should be added as token ids to be processed correctly.

Another complexity is that colbert authors recommend to pad queries with [MASK] tokens for query augmentation which leads to a better retrieval quality. Padding extends the query to 32 tokens.

Official colbert library allocates 32 tokens for query + (512 - 32 - 1) for context.
link to query tokenizer

We don't do such a separation, instead we just limit query to 512 tokens.

Colbert also requires removing punctuation tokens, which is being done after inference stage via modifying attention mask and the consequent multiplication of the model's output and updated attention mask.

@joein joein requested review from generall, NirantK and I8dNLo May 24, 2024 16:28
@joein joein merged commit c8fff66 into main May 31, 2024
17 checks passed
@joein joein deleted the colbert branch May 31, 2024 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants