Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALIGNMENT TO REF LONGER THAN 2000bp #17

Closed
TDDB-limagrain opened this issue Feb 23, 2022 · 12 comments
Closed

ALIGNMENT TO REF LONGER THAN 2000bp #17

TDDB-limagrain opened this issue Feb 23, 2022 · 12 comments

Comments

@TDDB-limagrain
Copy link

TDDB-limagrain commented Feb 23, 2022

hi @ksahlin ,
sorry for coming back with a new possible issue!
I used the very latest patch you've made to align a full sample against the public genome we discussed later.
The output was correctly piped to samtools for further duplicate marking and sorting, so thank you very much for allowing the redirection !

Total mapping sites tried: 260811459
Total calls to ssw: 83238353
Calls to ksw (rescue mode): 24136154
Did not fit strobe start site: 452500
Tried rescue: 8059431
Total time mapping: 4317.84 s.
Total time reading read-file(s): 178.746 s.
Total time creating strobemers: 188.722 s.
Total time finding NAMs (non-rescue mode): 618.305 s.
Total time finding NAMs (rescue mode): 341.146 s.
Total time sorting NAMs (candidate sites): 75.2316 s.
Total time reverse compl seq: 0 s.
Total time base level alignment (ssw): 2239.28 s.
Total time writing alignment to files: 525.32 s.

Everything goes well, but looking at the standard output, I obtained the following message several times:
ALIGNMENT TO REF LONGER THAN 2000bp - REPORT TO DEVELOPER.

For example:
ALIGNMENT TO REF LONGER THAN 2000bp - REPORT TO DEVELOPER. Happened for read: TGCCAATCCTGTGGGACGAATATGGGGCTCAAACCTCAGCCAAAACTCAATAGACACAGTGACGAATGTCTGGTAAAAAATTCAGACCAAAATACCAAAGGAGTAAGGCGTAGCAAGTCCCAGACCGAGAGTGAATAAAACCGGTTTTCCG ref len:53438756

Do you have any idea about such a behavior?
I ran strobealign with default parameters.

@ksahlin
Copy link
Owner

ksahlin commented Feb 23, 2022

Hehe, I'm the one who should apologize for user issues. I highly appreciate the feedback!

Ok, at least strobealign ran to completion and the alignment results can be analyzed downstream. I am eager to hear about the SNV and indel analysis.

I will try to fix this bug. It is known, hence the message to stderr. It is nothing crucial (I believe), but obviously still desirable to fix. I may already have such a case in my data. I will look if I identify this in my data so that I can fix it. If not, I may come back to you about obtaining at least one of the read pairs that caused this.

@TDDB-limagrain
Copy link
Author

Happy to help!
I just wanted to be sure that this message would not affect the downstream analysis. So I will analyze other samples and go to the variant analysis benchmark!

@ksahlin
Copy link
Owner

ksahlin commented Apr 1, 2022

Hi @TDDB-limagrain,

I have just released v0.7, which fixes this bug. The new release implements much more efficient multithreading impacting speed. I remember you did a comparison of runtime here. Not sure which version you tried, but at least compared to v0.6-0.6.1, the timing should be improved if you used several cores (the more cores the more the speedup). Happy for any feedback.

@ksahlin ksahlin closed this as completed Apr 1, 2022
@TDDB-limagrain
Copy link
Author

TDDB-limagrain commented Apr 1, 2022 via email

@ksahlin
Copy link
Owner

ksahlin commented Apr 1, 2022

Excellent!

Yes, let me know how it goes and what your runtime and SV-calling experiments say.

@TDDB-limagrain
Copy link
Author

Hi @ksahlin,
v0.7 seems to be indeed faster than the previous versions (avx2 compiled)
cputime previous version: 16262.03
cputime v0.7: 13876.03
with a very similar mapping rate.
Looking forward for the impact on downstream variant calling!

@ksahlin
Copy link
Owner

ksahlin commented Apr 5, 2022

Cool B-) how many cores did you use?

@TDDB-limagrain
Copy link
Author

I used -t 4 !

@ksahlin
Copy link
Owner

ksahlin commented Apr 5, 2022

Oh, yeah, I should have realized that from the command line in your other answer :)

I'm positively surprised that it has this relatively "big" speedup only for 4 cores - its great though. The relative speedup will be even bigger with more cores.

@TDDB-limagrain
Copy link
Author

TDDB-limagrain commented Apr 5, 2022 via email

@ksahlin
Copy link
Owner

ksahlin commented Apr 5, 2022

Thanks for the offer. It's not in priority, but if you have time/interested, sure!

My biggest interest lies in that the variant calls are good. Next up on the schedule is to output supplementary alignments.

@TDDB-limagrain
Copy link
Author

Hi,
I give a try with -t 16, ending up with a cputime of 14260.43.
runtime -t4: 3592s
runtime -t16: 1230s
almost a linear gain!

I just launched some job using synthetic reads simulated with the rtg tools suite. That will be easier for me to share some results rather than using private resequencing runs. Will keep you in touch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants