-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make strobealign work with large references #306
Conversation
As i just commented in #305 , I think it is a great idea until templates are implemented. |
This is printed before the index is actually created, so one way to use this is to let strobealign run until this point and then hit Ctrl+C if the reported estimate is too high.
This doubles memory usage for the randstrobe start vector, but makes strobealign work with large references (those with more than 2**32 strobemers).
During testing, I found a couple of other types that needed fixing, but it works now! I used the 100 bp single-end rye dataset for testing. With the default settings, the no. of strobemers is below 2 billion and we get a mapping rate of 99.2582% and accuracy of 69.2576%. To test this PR, I used the same dataset, but with options I think this is a good indication that there’s nothing fundementally wrong with the PR, so I’d say we should merge it and make a release (next week). |
Sounds great. I will also test a version on the standard benchmark that I typically do before release, just let me know a commit and can start it whenever we have something we believe is release-ready. |
Do you want to run the test before or after merging? |
I could test after merging. |
I can also compare to a version using XXHASH for s-mers if you want. But I don't see it in the commit network, maybe you didn’t push it yet? |
Ok, I’ll merge this now and then push a branch with the XXH changes. |
The syncmer XXH change is in branch |
I have benchmarked current develop (commit e0764b6, Name v0110), and the XXH change (commit d8dc256 , name XXH). See attached plots. Main points:
memory_plot.pdf |
I forgot to mention the indexing speed of the two latest versions (see below for 150nt reads, sorry for the formatting). In summary, both offer significant speedup over previous versions. XXH is 7-15% slower.
|
I usually don’t report the Jaccard-based numbers at the moment because I want the numbers to be comparable to what your evaluation script computes. I think it was important for the soft-clipping evaluation to use them, but for many other algorithm changes, it doesn’t matter that much. The difference is probably because I reported numbers for single-end read mapping. I usually test on single reads because it is faster and because I want to see the effect without mate-based rescue. That exaggerates differences a bit. |
I see, makes sense, thanks! Still worth to change to XXH produced s-mers in my opinion. |
(Marking as draft because this is untested.)
This unconditionally switches the bucket start indices from 32 to 64 bit. This makes strobealign work with more than$2^{32}$ strobemers, but also doubles memory usage of the index vector.
Closes #277
Closes #285