Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make strobealign work with large references #306

Merged
merged 12 commits into from
Jun 16, 2023
Merged

Make strobealign work with large references #306

merged 12 commits into from
Jun 16, 2023

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Jun 16, 2023

(Marking as draft because this is untested.)

This unconditionally switches the bucket start indices from 32 to 64 bit. This makes strobealign work with more than $2^{32}$ strobemers, but also doubles memory usage of the index vector.

Closes #277
Closes #285

@ksahlin
Copy link
Owner

ksahlin commented Jun 16, 2023

As i just commented in #305 , I think it is a great idea until templates are implemented.

marcelm added 4 commits June 16, 2023 14:12
This is printed before the index is actually created, so one way to use this
is to let strobealign run until this point and then hit Ctrl+C if the
reported estimate is too high.
This doubles memory usage for the randstrobe start vector, but makes
strobealign work with large references (those with more than 2**32
strobemers).
@marcelm marcelm marked this pull request as ready for review June 16, 2023 13:50
@marcelm
Copy link
Collaborator Author

marcelm commented Jun 16, 2023

During testing, I found a couple of other types that needed fixing, but it works now!

I used the 100 bp single-end rye dataset for testing.

With the default settings, the no. of strobemers is below 2 billion and we get a mapping rate of 99.2582% and accuracy of 69.2576%. To test this PR, I used the same dataset, but with options -k 22 -s 22, which results in over 7 billion strobemers and needs 140 GB of RAM. The mapping rate was 99.2639% and accuracy 70.1665%.

I think this is a good indication that there’s nothing fundementally wrong with the PR, so I’d say we should merge it and make a release (next week).

@ksahlin
Copy link
Owner

ksahlin commented Jun 16, 2023

Sounds great. I will also test a version on the standard benchmark that I typically do before release, just let me know a commit and can start it whenever we have something we believe is release-ready.

@marcelm
Copy link
Collaborator Author

marcelm commented Jun 16, 2023

Do you want to run the test before or after merging?

@ksahlin
Copy link
Owner

ksahlin commented Jun 16, 2023

I could test after merging.

@ksahlin
Copy link
Owner

ksahlin commented Jun 16, 2023

I can also compare to a version using XXHASH for s-mers if you want. But I don't see it in the commit network, maybe you didn’t push it yet?

@marcelm
Copy link
Collaborator Author

marcelm commented Jun 16, 2023

Ok, I’ll merge this now and then push a branch with the XXH changes.

@marcelm marcelm merged commit 08fe904 into main Jun 16, 2023
@marcelm marcelm deleted the largerefs2 branch June 16, 2023 14:33
@marcelm
Copy link
Collaborator Author

marcelm commented Jun 16, 2023

The syncmer XXH change is in branch syncmerhash (commit d8dc256).

@ksahlin
Copy link
Owner

ksahlin commented Jun 17, 2023

I have benchmarked current develop (commit e0764b6, Name v0110), and the XXH change (commit d8dc256 , name XXH). See attached plots.

Main points:

  1. Memory consumption goes down considerably in v0.11.0 (but we knew that)
  2. The new index seems to be also slightly, but consistently, faster for short read lengths (~50-125nt) on the three larger references.
  3. XXH branch does not introduce a big computational overhead, it is nearly as fast as 'v0.11.0' (we should consider including it in the release)
  4. I don't observe as high benefits using XXH as you did. Maybe because I am using the old binary overlap comparison script, while you are using the Jaccard based evaluation script? I get 0.15% increase in mapping accuracy for CHM , 0.17% for Maize, and 0.28% for rye for 50nt paired-end reads. The mapping-only (no extension) accuracy increases considerably though.
  5. Mapping rate also goes up significantly for 50nt reads with XXH compared to other versions.

memory_plot.pdf
time_plot.pdf
accuracy_plot.pdf
percentage_aligned_plot.pdf

@ksahlin
Copy link
Owner

ksahlin commented Jun 17, 2023

I forgot to mention the indexing speed of the two latest versions (see below for 150nt reads, sorry for the formatting).

In summary, both offer significant speedup over previous versions. XXH is 7-15% slower.

150bp v0.11.0
Maize  Total time indexing: 92.44 s
CHM Total time indexing: 137.94 s
Rye Total time indexing: 343.51 s

150bp XXH 
Maize Total time indexing: 106.39 s
CHM Total time indexing: 153.05 s
Rye Total time indexing: 368.90 s

@marcelm marcelm mentioned this pull request Jun 19, 2023
12 tasks
@marcelm
Copy link
Collaborator Author

marcelm commented Jun 19, 2023

4. I don't observe as high benefits using XXH as you did. Maybe because I am using the old binary overlap comparison script, while you are using the Jaccard based evaluation script? I get 0.15% increase in mapping accuracy for CHM , 0.17% for Maize, and 0.28% for rye for 50nt paired-end reads. The mapping-only (no extension) accuracy increases considerably though.

I usually don’t report the Jaccard-based numbers at the moment because I want the numbers to be comparable to what your evaluation script computes. I think it was important for the soft-clipping evaluation to use them, but for many other algorithm changes, it doesn’t matter that much.

The difference is probably because I reported numbers for single-end read mapping. I usually test on single reads because it is faster and because I want to see the effect without mate-based rescue. That exaggerates differences a bit.

@ksahlin
Copy link
Owner

ksahlin commented Jun 19, 2023

I see, makes sense, thanks! Still worth to change to XXH produced s-mers in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

map overflow - large reference db outputs empty sam files
2 participants