Make strobealign work with large references #306

marcelm · 2023-06-16T08:03:14Z

(Marking as draft because this is untested.)

This unconditionally switches the bucket start indices from 32 to 64 bit. This makes strobealign work with more than $2^{32}$ strobemers, but also doubles memory usage of the index vector.

Closes #277
Closes #285

ksahlin · 2023-06-16T08:37:53Z

As i just commented in #305 , I think it is a great idea until templates are implemented.

This is printed before the index is actually created, so one way to use this is to let strobealign run until this point and then hit Ctrl+C if the reported estimate is too high.

This doubles memory usage for the randstrobe start vector, but makes strobealign work with large references (those with more than 2**32 strobemers).

marcelm · 2023-06-16T13:58:26Z

During testing, I found a couple of other types that needed fixing, but it works now!

I used the 100 bp single-end rye dataset for testing.

With the default settings, the no. of strobemers is below 2 billion and we get a mapping rate of 99.2582% and accuracy of 69.2576%. To test this PR, I used the same dataset, but with options -k 22 -s 22, which results in over 7 billion strobemers and needs 140 GB of RAM. The mapping rate was 99.2639% and accuracy 70.1665%.

I think this is a good indication that there’s nothing fundementally wrong with the PR, so I’d say we should merge it and make a release (next week).

ksahlin · 2023-06-16T14:16:43Z

Sounds great. I will also test a version on the standard benchmark that I typically do before release, just let me know a commit and can start it whenever we have something we believe is release-ready.

marcelm · 2023-06-16T14:20:25Z

Do you want to run the test before or after merging?

ksahlin · 2023-06-16T14:23:20Z

I could test after merging.

ksahlin · 2023-06-16T14:26:48Z

I can also compare to a version using XXHASH for s-mers if you want. But I don't see it in the commit network, maybe you didn’t push it yet?

marcelm · 2023-06-16T14:32:47Z

Ok, I’ll merge this now and then push a branch with the XXH changes.

marcelm · 2023-06-16T14:48:45Z

The syncmer XXH change is in branch syncmerhash (commit d8dc256).

ksahlin · 2023-06-17T05:35:18Z

I have benchmarked current develop (commit e0764b6, Name v0110), and the XXH change (commit d8dc256 , name XXH). See attached plots.

Main points:

Memory consumption goes down considerably in v0.11.0 (but we knew that)
The new index seems to be also slightly, but consistently, faster for short read lengths (~50-125nt) on the three larger references.
XXH branch does not introduce a big computational overhead, it is nearly as fast as 'v0.11.0' (we should consider including it in the release)
I don't observe as high benefits using XXH as you did. Maybe because I am using the old binary overlap comparison script, while you are using the Jaccard based evaluation script? I get 0.15% increase in mapping accuracy for CHM , 0.17% for Maize, and 0.28% for rye for 50nt paired-end reads. The mapping-only (no extension) accuracy increases considerably though.
Mapping rate also goes up significantly for 50nt reads with XXH compared to other versions.

memory_plot.pdf
time_plot.pdf
accuracy_plot.pdf
percentage_aligned_plot.pdf

ksahlin · 2023-06-17T06:19:05Z

I forgot to mention the indexing speed of the two latest versions (see below for 150nt reads, sorry for the formatting).

In summary, both offer significant speedup over previous versions. XXH is 7-15% slower.

150bp v0.11.0
Maize  Total time indexing: 92.44 s
CHM Total time indexing: 137.94 s
Rye Total time indexing: 343.51 s

150bp XXH 
Maize Total time indexing: 106.39 s
CHM Total time indexing: 153.05 s
Rye Total time indexing: 368.90 s

marcelm · 2023-06-19T13:45:00Z

4. I don't observe as high benefits using XXH as you did. Maybe because I am using the old binary overlap comparison script, while you are using the Jaccard based evaluation script? I get 0.15% increase in mapping accuracy for CHM , 0.17% for Maize, and 0.28% for rye for 50nt paired-end reads. The mapping-only (no extension) accuracy increases considerably though.

I usually don’t report the Jaccard-based numbers at the moment because I want the numbers to be comparable to what your evaluation script computes. I think it was important for the soft-clipping evaluation to use them, but for many other algorithm changes, it doesn’t matter that much.

The difference is probably because I reported numbers for single-end read mapping. I usually test on single reads because it is faster and because I want to see the effect without mate-based rescue. That exaggerates differences a bit.

ksahlin · 2023-06-19T14:07:07Z

I see, makes sense, thanks! Still worth to change to XXH produced s-mers in my opinion.

marcelm added 4 commits June 16, 2023 09:56

Remove unused struct member

bec6dfc

Compute frac_unique when needed

71c3e61

Make stats safe to use with large references

2a463f2

Fix indentation

ee6c5fa

marcelm force-pushed the largerefs2 branch from 4956032 to b5d58fb Compare June 16, 2023 09:37

marcelm added 4 commits June 16, 2023 14:12

Log estimated memory usage

a8ff1b4

This is printed before the index is actually created, so one way to use this is to let strobealign run until this point and then hit Ctrl+C if the reported estimate is too high.

Fix return type

d626c1c

Use end() instead of -1

a156c8d

Switch to 64-bit bucket start indices

bdf4012

This doubles memory usage for the randstrobe start vector, but makes strobealign work with large references (those with more than 2**32 strobemers).

marcelm force-pushed the largerefs2 branch from b5d58fb to bdf4012 Compare June 16, 2023 12:12

marcelm added 3 commits June 16, 2023 14:16

Correct log messages

aa49845

Log no. of randstrobes

eaa7a48

Fix some types

f71a67d

marcelm marked this pull request as ready for review June 16, 2023 13:50

Changelog

bdb0f63

marcelm merged commit 08fe904 into main Jun 16, 2023

marcelm deleted the largerefs2 branch June 16, 2023 14:33

marcelm mentioned this pull request Jun 19, 2023

Release 0.11.0 #309

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make strobealign work with large references #306

Make strobealign work with large references #306

marcelm commented Jun 16, 2023 •

edited

Loading

ksahlin commented Jun 16, 2023

marcelm commented Jun 16, 2023 •

edited

Loading

ksahlin commented Jun 16, 2023

marcelm commented Jun 16, 2023

ksahlin commented Jun 16, 2023

ksahlin commented Jun 16, 2023

marcelm commented Jun 16, 2023

marcelm commented Jun 16, 2023

ksahlin commented Jun 17, 2023 •

edited

Loading

ksahlin commented Jun 17, 2023

marcelm commented Jun 19, 2023

ksahlin commented Jun 19, 2023

Make strobealign work with large references #306

Make strobealign work with large references #306

Conversation

marcelm commented Jun 16, 2023 • edited Loading

ksahlin commented Jun 16, 2023

marcelm commented Jun 16, 2023 • edited Loading

ksahlin commented Jun 16, 2023

marcelm commented Jun 16, 2023

ksahlin commented Jun 16, 2023

ksahlin commented Jun 16, 2023

marcelm commented Jun 16, 2023

marcelm commented Jun 16, 2023

ksahlin commented Jun 17, 2023 • edited Loading

ksahlin commented Jun 17, 2023

marcelm commented Jun 19, 2023

ksahlin commented Jun 19, 2023

marcelm commented Jun 16, 2023 •

edited

Loading

marcelm commented Jun 16, 2023 •

edited

Loading

ksahlin commented Jun 17, 2023 •

edited

Loading