Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large references #305

Closed
4 of 7 tasks
marcelm opened this issue Jun 16, 2023 · 2 comments
Closed
4 of 7 tasks

Large references #305

marcelm opened this issue Jun 16, 2023 · 2 comments

Comments

@marcelm
Copy link
Collaborator

marcelm commented Jun 16, 2023

Here is my to-do list for making strobealign work with references with more than $2^{32}$ strobemers. This will resolve #277 and #285.

  • Counters (unique strobemers, total no. of strobemers etc.) need to be bumped from unsigned to uint64_t. This can be done unconditionally as it’s not a memory or performance issue when these are just always 64 bit.
  • Audit types: search for int and unsigned and check whether they need to be replace with bucket_index_t, size_t, uint64_t
  • Switch the bucket_index_t typedef to uint64_t
  • Test whether this modified versions works with large references.
  • Extract the hash table part of StrobemerIndex into a separate class (let’s call it Hashtable here)
  • Make bucket_index_t a template parameter of that class
  • Decide at runtime whether to use a Hashtable<uint32_t> or Hashtable<uint64_t> (use a virtual function or a function pointer).

Much of the above is done already and the question for me is whether we perhaps want to release a strobealign version now that unconditionally uses 64 bit bucket start indices because I see that it is a bit of work to switch to dynamically deciding which size of indices to use.

For CHM13, the index vector currently needs 1 GiB ($2^{28}$ entries times 4 bytes per entry). That size would double with 64-bit indices, so 2 GiB. Overall memory usage would increase from 13 GiB to 14 GiB. But then, memory usage was 22 GiB before merging #278, so the savings are still huge.

@ksahlin
Copy link
Owner

ksahlin commented Jun 16, 2023

I think its a great idea to unconditionally set it to use 64 bit bucket start indices for now and make a release so that #277 and #285 gets resolved. Let's do that!

As you say, going from 1gb to 2gb in human or 2gb to 4gb in rye) is negligible compared to flat vector memory.

@marcelm
Copy link
Collaborator Author

marcelm commented Sep 8, 2023

When we met yesterday, we decided to close this issue for now as we think it is quite some work to dynamically switch between 32 and 64 bit indices. Always using 64 bits is good enough. Also, we are considering some further memory reduction, which are no longer possible if the index vector only uses 32 bits.

@marcelm marcelm closed this as not planned Won't fix, can't repro, duplicate, stale Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants