Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --details option for showing alignment details #318

Merged
merged 16 commits into from
Jun 27, 2023
Merged

Add --details option for showing alignment details #318

merged 16 commits into from
Jun 27, 2023

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Jun 26, 2023

This adds a command-line option named --details. When provided, strobealign adds a couple of extra SAM tags to each record that contains an aligned read. At the moment, these two tags are added, but this can grow in the future:

  • na: Number of NAMs
  • re: Whether the NAMs where found normally (value: 0) or through rescue (value: 1)
    (Lowercase because tags containing lowercase letters are "reserved for end users" according to the SAM spec.)

I've been wanting to have something like this for a while in order to be able to better see how a certain alignment arose.

This is mainly intended for debugging and would hopefully allow us to more directly see why a read failed to align, why it aligned incorrectly or why anything else unexpected happened. We could also instruct users to re-run strobealign with that option when they report an enexpected alignment, which would maybe allow us to help them or identify the problem without having to re-run the dataset ourselves.

We could also use this (by parsing the tags with an extra script) to get more detailed statistics. For example, we could get the distribution of the number of NAMs or find regions where rescue was necessary particularly often.

Other tags we could add:

  • Whether a mate was rescued
  • Whether ungapped (Hamming) or gapped (SSW) alignment was done

For the moment, some function signatures look a bit more complicated because they have gained yet another parameter. However, for many functions, we can switch from passing Statistics and Details to only passing Details (which would need to gain the same attributes as the statistics). Then the functions don't know about general statistics anymore, but only fill in the particular read-specific details. Then after processing the read, the details are added to the statistics. (This is already done in this PR for the no. of rescued reads.)

This PR also contains some refactoring, some of which was necessary to get this to work, but some of which was only indirectly related.

Marking this as draft because I need to measure impact on performance (I expect it to be essentially zero).

@ksahlin
Copy link
Owner

ksahlin commented Jun 26, 2023

I think this PR is excellent, and I agree with what you write about adding whether a mate was rescued and whether it was hamming or not.

I also had a look at the code changes and looks good. I was a little bit confused whether the is_suppl -> is_primary change is also propagated to the PE reads (I saw only changes in the SE case), but I assume you have checked and also the phiX dataset does not show any changes.

@marcelm
Copy link
Collaborator Author

marcelm commented Jun 26, 2023

I forgot to mention: I actually used -d for this option and --details is just an alias. Let me know if you prefer not to use the (valuable) single-character option. BWA-MEM uses -d, where it means "off-diagonal X-dropoff".

I think this PR is excellent, and I agree with what you write about adding whether a mate was rescued and whether it was hamming or not.

I added the mr tag, which indicates that "mate rescue" was done. I've adjusted the terminology also in the logging output to distinguish "NAM rescue" and "mate rescue". I'll look into hamming/SSW as well.

I also had a look at the code changes and looks good. I was a little bit confused whether the is_suppl -> is_primary change is also propagated to the PE reads (I saw only changes in the SE case), but I assume you have checked and also the phiX dataset does not show any changes.

It was indeed confusing: The paired-end code already had the is_primary logic; commit 0755994 just makes it consistent for both the single- and paired-end functions.

@ksahlin
Copy link
Owner

ksahlin commented Jun 27, 2023

I forgot to mention: I actually used -d for this option and --details is just an alias. Let me know if you prefer not to use the (valuable) single-character option. BWA-MEM uses -d, where it means "off-diagonal X-dropoff".

Good point, since this will mainly be used for diagnostics I think we can save -d and use only --details.

I added the mr tag, which indicates that "mate rescue" was done. I've adjusted the terminology also in the logging output to distinguish "NAM rescue" and "mate rescue". I'll look into hamming/SSW as well.

Awesome!

@marcelm marcelm marked this pull request as ready for review June 27, 2023 09:33
@marcelm
Copy link
Collaborator Author

marcelm commented Jun 27, 2023

Ok, I removed the -d alias, added a ga tag that shows the no. of gapped alignments, and managed to get rid of the Statistics parameter for some functions (they only get Details now).

The impact on runtime when --details is not used is nonexistent, as expected.

@ksahlin
Copy link
Owner

ksahlin commented Jun 27, 2023

okay, great. Happy to merge.

@marcelm marcelm merged commit 8f397fb into main Jun 27, 2023
@marcelm marcelm deleted the details branch June 27, 2023 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants