Colbert #248

joein · 2024-05-24T15:58:08Z

This PR adds classes for late interaction embeddings.
The only supported model at the moment is colbertv2.0

A few complexities in this PR are bound to the way colbert processes documents and queries.
It adds a special token [Q] or [D] as a second token in a sequence of tokens.
They should be added as token ids to be processed correctly.

Another complexity is that colbert authors recommend to pad queries with [MASK] tokens for query augmentation which leads to a better retrieval quality. Padding extends the query to 32 tokens.

Official colbert library allocates 32 tokens for query + (512 - 32 - 1) for context.
link to query tokenizer

We don't do such a separation, instead we just limit query to 512 tokens.

Colbert also requires removing punctuation tokens, which is being done after inference stage via modifying attention mask and the consequent multiplication of the model's output and updated attention mask.

joein requested review from generall, NirantK and I8dNLo May 24, 2024 16:28

generall approved these changes May 30, 2024

View reviewed changes

joein added 6 commits May 31, 2024 16:58

new: add late interaction embedding, colbert

4282cb2

new: update imports

67c84cd

new: add comments

5a11429

fix: rollback mp methods

41604b1

fix: restore existing padding after embed query

aee5465

fix: fix OnnxOutputContext in onnx embed, fix preprocessing for colbert

d43ce18

joein force-pushed the colbert branch from d02376d to d43ce18 Compare May 31, 2024 14:58

joein merged commit c8fff66 into main May 31, 2024
17 checks passed

joein deleted the colbert branch May 31, 2024 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colbert #248

Colbert #248

joein commented May 24, 2024 •

edited

Loading

Colbert #248

Colbert #248

Conversation

joein commented May 24, 2024 • edited Loading

joein commented May 24, 2024 •

edited

Loading