Add `--details` option for showing alignment details #318

marcelm · 2023-06-26T14:13:39Z

This adds a command-line option named --details. When provided, strobealign adds a couple of extra SAM tags to each record that contains an aligned read. At the moment, these two tags are added, but this can grow in the future:

na: Number of NAMs
re: Whether the NAMs where found normally (value: 0) or through rescue (value: 1)
(Lowercase because tags containing lowercase letters are "reserved for end users" according to the SAM spec.)

I've been wanting to have something like this for a while in order to be able to better see how a certain alignment arose.

This is mainly intended for debugging and would hopefully allow us to more directly see why a read failed to align, why it aligned incorrectly or why anything else unexpected happened. We could also instruct users to re-run strobealign with that option when they report an enexpected alignment, which would maybe allow us to help them or identify the problem without having to re-run the dataset ourselves.

We could also use this (by parsing the tags with an extra script) to get more detailed statistics. For example, we could get the distribution of the number of NAMs or find regions where rescue was necessary particularly often.

Other tags we could add:

Whether a mate was rescued
Whether ungapped (Hamming) or gapped (SSW) alignment was done

For the moment, some function signatures look a bit more complicated because they have gained yet another parameter. However, for many functions, we can switch from passing Statistics and Details to only passing Details (which would need to gain the same attributes as the statistics). Then the functions don't know about general statistics anymore, but only fill in the particular read-specific details. Then after processing the read, the details are added to the statistics. (This is already done in this PR for the no. of rescued reads.)

This PR also contains some refactoring, some of which was necessary to get this to work, but some of which was only indirectly related.

Marking this as draft because I need to measure impact on performance (I expect it to be essentially zero).

It is only used to update read-reading runtime, which we can measure outside the function.

plus some renames

ksahlin · 2023-06-26T14:40:57Z

I think this PR is excellent, and I agree with what you write about adding whether a mate was rescued and whether it was hamming or not.

I also had a look at the code changes and looks good. I was a little bit confused whether the is_suppl -> is_primary change is also propagated to the PE reads (I saw only changes in the SE case), but I assume you have checked and also the phiX dataset does not show any changes.

marcelm · 2023-06-26T16:59:20Z

I forgot to mention: I actually used -d for this option and --details is just an alias. Let me know if you prefer not to use the (valuable) single-character option. BWA-MEM uses -d, where it means "off-diagonal X-dropoff".

I think this PR is excellent, and I agree with what you write about adding whether a mate was rescued and whether it was hamming or not.

I added the mr tag, which indicates that "mate rescue" was done. I've adjusted the terminology also in the logging output to distinguish "NAM rescue" and "mate rescue". I'll look into hamming/SSW as well.

I also had a look at the code changes and looks good. I was a little bit confused whether the is_suppl -> is_primary change is also propagated to the PE reads (I saw only changes in the SE case), but I assume you have checked and also the phiX dataset does not show any changes.

It was indeed confusing: The paired-end code already had the is_primary logic; commit 0755994 just makes it consistent for both the single- and paired-end functions.

ksahlin · 2023-06-27T03:53:43Z

I forgot to mention: I actually used -d for this option and --details is just an alias. Let me know if you prefer not to use the (valuable) single-character option. BWA-MEM uses -d, where it means "off-diagonal X-dropoff".

Good point, since this will mainly be used for diagnostics I think we can save -d and use only --details.

I added the mr tag, which indicates that "mate rescue" was done. I've adjusted the terminology also in the logging output to distinguish "NAM rescue" and "mate rescue". I'll look into hamming/SSW as well.

Awesome!

and add al:i: SAM tag for no. of alignments

marcelm · 2023-06-27T09:35:04Z

Ok, I removed the -d alias, added a ga tag that shows the no. of gapped alignments, and managed to get rid of the Statistics parameter for some functions (they only get Details now).

The impact on runtime when --details is not used is nonexistent, as expected.

ksahlin · 2023-06-27T12:13:03Z

okay, great. Happy to merge.

marcelm added 6 commits June 26, 2023 10:32

Use CigarOps enum

109bd36

Do not pass statistics to read_records

cd5a49d

It is only used to update read-reading runtime, which we can measure outside the function.

Let get_best_scoring_pair return a vector

18a8d5f

plus some renames

Introduce a Details struct for alignment-level "statistics"

bba0fd5

Let rescue_mate() return whether rescue was successful

90cb9d2

is_secondary → is_primary

0755994

marcelm force-pushed the details branch from f2cce55 to 8d2ec08 Compare June 26, 2023 20:49

Option --details adds alignment details to SAM output

1799a07

marcelm force-pushed the details branch from 8d2ec08 to 811c6c4 Compare June 27, 2023 08:39

marcelm added 7 commits June 27, 2023 11:12

Test --details output

eac4753

did_not_fit -> inconsistent_nam

140225f

Print ref_id in Nam() representation

6670ec2

With --details, show whether alignment arose from mate-based rescue

963f872

Rename tried_rescue to nam_rescue

ed87e95

Track tot_rescued and tot_all_tried through Details, not Statistics

01324bd

and add al:i: SAM tag for no. of alignments

Track inconsistent NAMs in Detail

6616e07

marcelm force-pushed the details branch from 811c6c4 to 05af718 Compare June 27, 2023 09:12

Show no. of gapped alignments in details

04334ed

marcelm force-pushed the details branch from 05af718 to 04334ed Compare June 27, 2023 09:26

Install samtools for CI tests

6ee6a4d

marcelm marked this pull request as ready for review June 27, 2023 09:33

marcelm merged commit 8f397fb into main Jun 27, 2023

marcelm deleted the details branch June 27, 2023 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--details` option for showing alignment details #318

Add `--details` option for showing alignment details #318

marcelm commented Jun 26, 2023

ksahlin commented Jun 26, 2023

marcelm commented Jun 26, 2023

ksahlin commented Jun 27, 2023

marcelm commented Jun 27, 2023

ksahlin commented Jun 27, 2023

Add --details option for showing alignment details #318

Add --details option for showing alignment details #318

Conversation

marcelm commented Jun 26, 2023

ksahlin commented Jun 26, 2023

marcelm commented Jun 26, 2023

ksahlin commented Jun 27, 2023

marcelm commented Jun 27, 2023

ksahlin commented Jun 27, 2023

Add `--details` option for showing alignment details #318

Add `--details` option for showing alignment details #318