-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --details
option for showing alignment details
#318
Conversation
It is only used to update read-reading runtime, which we can measure outside the function.
plus some renames
I think this PR is excellent, and I agree with what you write about adding whether a mate was rescued and whether it was hamming or not. I also had a look at the code changes and looks good. I was a little bit confused whether the |
I forgot to mention: I actually used
I added the
It was indeed confusing: The paired-end code already had the |
Good point, since this will mainly be used for diagnostics I think we can save
Awesome! |
and add al:i: SAM tag for no. of alignments
Ok, I removed the The impact on runtime when |
okay, great. Happy to merge. |
This adds a command-line option named
--details
. When provided, strobealign adds a couple of extra SAM tags to each record that contains an aligned read. At the moment, these two tags are added, but this can grow in the future:na
: Number of NAMsre
: Whether the NAMs where found normally (value: 0) or through rescue (value: 1)(Lowercase because tags containing lowercase letters are "reserved for end users" according to the SAM spec.)
I've been wanting to have something like this for a while in order to be able to better see how a certain alignment arose.
This is mainly intended for debugging and would hopefully allow us to more directly see why a read failed to align, why it aligned incorrectly or why anything else unexpected happened. We could also instruct users to re-run strobealign with that option when they report an enexpected alignment, which would maybe allow us to help them or identify the problem without having to re-run the dataset ourselves.
We could also use this (by parsing the tags with an extra script) to get more detailed statistics. For example, we could get the distribution of the number of NAMs or find regions where rescue was necessary particularly often.
Other tags we could add:
For the moment, some function signatures look a bit more complicated because they have gained yet another parameter. However, for many functions, we can switch from passing
Statistics
andDetails
to only passingDetails
(which would need to gain the same attributes as the statistics). Then the functions don't know about general statistics anymore, but only fill in the particular read-specific details. Then after processing the read, the details are added to the statistics. (This is already done in this PR for the no. of rescued reads.)This PR also contains some refactoring, some of which was necessary to get this to work, but some of which was only indirectly related.
Marking this as draft because I need to measure impact on performance (I expect it to be essentially zero).