Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--interleaved option and SAM flag for paired reads #273

Closed
akautt opened this issue Apr 27, 2023 · 7 comments · Fixed by #275
Closed

--interleaved option and SAM flag for paired reads #273

akautt opened this issue Apr 27, 2023 · 7 comments · Fixed by #275

Comments

@akautt
Copy link

akautt commented Apr 27, 2023

I might be missing something, but it seems that strobealign (v0.8.0) does not properly specify SAM flag values for paired reads

When I run it on separate R1 + R2 fastqs, reads in the bam have flags like 77 and 141 (paired unmapped reads; indicating first and second read in pair), but when I run it with '--interleaved' on, otherwise identical, interleaved fastq input, the flags do not contain any information on read pairs anymore (e.g., simply 4 for unmapped reads). Note that strobealign correctly indicates "paired-end mode" in both cases. The lack of read pair information cause issues for downstream applications for me.

Since everything works just fine –great actually! strobealign is amazingly fast; thanks for developing such a great tool– using separate fastqs, I hope it's not too hard to fix this for the --interleaved case. I'd really prefer interleaved mode over separate fastqs since it allows me to efficiently pipe input strobealign

Files and commands for a reproducible example below:

Separate R1 and R2 fastqs as input

strobealign -t 1 reference.fasta R1.fastq R2.fastq | head

> This is strobealign v0.8.0
Estimated read length: 151 bp
Time reading reference: 0.00 s
Reference size: 0.02 Mbp (1 contig; largest: 0.02 Mbp)
Indexing ...
  Time generating seeds: 0.00 s
  Time estimating number of unique hashes: 0.00 s
  Time sorting non-unique seeds: 0.00 s
  Time generating hash table index: 0.00 s
Total time indexing: 0.00 s
Running in @HD  VN:1.6  SO:unsorted
@SQ     SN:MT   LN:16322
@PG     ID:strobealign  PN:strobealign  VN:v0.8.0       CL:strobealign
paired-end mode
A00794:601:HC32JDSX3:1:1101:10004:15248 77      *       0       0       *       *       0       0       CCATGATGGACCGCTTGTGCCACCTACTGGCCAATGGTAGTATCACATGCCGGATATCTGCCTATCCAGGAGAGGTAGCTAGAAAATGCAACCACCGAGTGTGCACACACACTGTGTGGGTACACACACACATGTGCATATACGCACTCAC      FFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFF:FFFFFFF:FFFFFFF:FFFFFFFFFFF,FFFF:FF:FFFFF::FFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFF:FFFFFFFFFFFFF::FF:FFFFFFF
A00794:601:HC32JDSX3:1:1101:10004:15248 141     *       0       0       *       *       0       0       ATATAATGCCAACATGTATGATTCTGTATGTGAGTGCGTATATGCACATGTGTGTGTGTACCCACACAGTGTGTGTGCACACTCGGTGGTTGCATTTTCTAGCTACCTCTCCTGGATAGGCAGATATCCGGCATGTGATACTACCATTGGC      FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFF:FFFFFF,FFFFFFFFFFFFFFFFFFFFF:FF:FFF:FFFF
A00794:601:HC32JDSX3:1:1101:10004:34632 77      *       0       0       *       *       0       0       CACCTTAACTACTCAGATAAGTGACTGTCATTTGTACCTGAGAATGCCAAAGGGTTAAAGATGTCCAGGATCCATGTAAATGAGTTTTTGTCCCCAAAGAAATTTGTACAAATTTTAGGACACACAAATAAATTAGCACAAGTTAGCAAGG      FFFF:,FFFFFFF:FFFFFF::FFFFFFFFFFFFFFFFFFFFFFFF:F,FFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFF:,
A00794:601:HC32JDSX3:1:1101:10004:34632 141     *       0       0       *       *       0       0       GGACAAGGAACCTTGCTATCTGACACACCTCTTGGCAAAGTCTAAAGGAAACCATGACTCTGGTGACTAAGAAGTGACCCTGAGCTCCCCACTCACACCCCTTGCTAACTTGTGCTAATTTATTTGTGTGTCCTAAAATTTGTACAAATTT      FFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:,FFFFF:FF,FFF:FFFFFFFFFFF:FFFFF:FFFFFFFFF
A00794:601:HC32JDSX3:1:1101:1000:25989  77      *       0       0       *       *       0       0       NTCCTGCCCCTCTTTGGACGAGAAGATGTGGGGTTTTAGGCATGCTTCAACCCCTTCTTAGTTTTAGGCCAGCCTTGCAGATATTAGGCAAGGATATTTTGTATGAATCAAGGATACAGCATATTCATTACACTTCTGGTCACTGTGACAT      #FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
A00794:601:HC32JDSX3:1:1101:1000:25989  141     *       0       0       *       *       0       0       ATGTCACAGTGACCAGAAGTGTAATGAATATGCTGTATCCTTGATTCATACAAAATATCCTTGCCTAATATCTGCAAGGCTGGCCTAAAACTAAGAAGGGGTTGAAGCATGCCTAAAACCCCACATCTTCTCGTCCAAAGAGGGGCAGGAA      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:

Interleaved mode

strobealign -t 1 reference.fasta interleaved.fastq --interleaved | head

Estimated read length: 151 bp
Time reading reference: 0.00 s
Reference size: 0.02 Mbp (1 contig; largest: 0.02 Mbp)
Indexing ...
  Time generating seeds: 0.00 s
  Time estimating number of unique hashes: 0.00 s
  Time sorting non-unique seeds: 0.00 s
  Time generating hash table index: 0.00 s
Total time indexing: 0.00 s
Running in @HD  VN:1.6  SO:unsorted
@SQ     SN:MT   LN:16322
@PG     ID:strobealign  PN:strobealign  VN:v0.8.0       CL:strobealign
paired-end mode
A00794:601:HC32JDSX3:1:1101:10004:15248 4       *       0       0       *       *       0       0       CCATGATGGACCGCTTGTGCCACCTACTGGCCAATGGTAGTATCACATGCCGGATATCTGCCTATCCAGGAGAGGTAGCTAGAAAATGCAACCACCGAGTGTGCACACACACTGTGTGGGTACACACACACATGTGCATATACGCACTCAC      FFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFF:FFFFFFF:FFFFFFF:FFFFFFFFFFF,FFFF:FF:FFFFF::FFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFF:FFFFFFFFFFFFF::FF:FFFFFFF
A00794:601:HC32JDSX3:1:1101:10004:15248 4       *       0       0       *       *       0       0       ATATAATGCCAACATGTATGATTCTGTATGTGAGTGCGTATATGCACATGTGTGTGTGTACCCACACAGTGTGTGTGCACACTCGGTGGTTGCATTTTCTAGCTACCTCTCCTGGATAGGCAGATATCCGGCATGTGATACTACCATTGGC      FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFF:FFFFFF,FFFFFFFFFFFFFFFFFFFFF:FF:FFF:FFFF
A00794:601:HC32JDSX3:1:1101:10004:34632 4       *       0       0       *       *       0       0       CACCTTAACTACTCAGATAAGTGACTGTCATTTGTACCTGAGAATGCCAAAGGGTTAAAGATGTCCAGGATCCATGTAAATGAGTTTTTGTCCCCAAAGAAATTTGTACAAATTTTAGGACACACAAATAAATTAGCACAAGTTAGCAAGG      FFFF:,FFFFFFF:FFFFFF::FFFFFFFFFFFFFFFFFFFFFFFF:F,FFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFF:,
A00794:601:HC32JDSX3:1:1101:10004:34632 4       *       0       0       *       *       0       0       GGACAAGGAACCTTGCTATCTGACACACCTCTTGGCAAAGTCTAAAGGAAACCATGACTCTGGTGACTAAGAAGTGACCCTGAGCTCCCCACTCACACCCCTTGCTAACTTGTGCTAATTTATTTGTGTGTCCTAAAATTTGTACAAATTT      FFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:,FFFFF:FF,FFF:FFFFFFFFFFF:FFFFF:FFFFFFFFF
A00794:601:HC32JDSX3:1:1101:1000:25989  4       *       0       0       *       *       0       0       NTCCTGCCCCTCTTTGGACGAGAAGATGTGGGGTTTTAGGCATGCTTCAACCCCTTCTTAGTTTTAGGCCAGCCTTGCAGATATTAGGCAAGGATATTTTGTATGAATCAAGGATACAGCATATTCATTACACTTCTGGTCACTGTGACAT      #FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
A00794:601:HC32JDSX3:1:1101:1000:25989  4       *       0       0       *       *       0       0       ATGTCACAGTGACCAGAAGTGTAATGAATATGCTGTATCCTTGATTCATACAAAATATCCTTGCCTAATATCTGCAAGGCTGGCCTAAAACTAAGAAGGGGTTGAAGCATGCCTAAAACCCCACATCTTCTCGTCCAAAGAGGGGCAGGAA      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:

strobealign_interleaved_test.tar.gz

@marcelm
Copy link
Collaborator

marcelm commented Apr 27, 2023

Thanks for reporting! Thanks also for the test case, which allowed me to reproduce this problem.

The issue appears to be the following: strobealign’s --interleaved option actually enables a "mixed mode" (I believe it is equivalent to the bwa mem option -p) that allows input from a single file to contain both single and paired-end reads. This is achieved by looking at the read names: If two consecutive reads in the file have the same name, they are considered to be pairs. If the names are different, they are treated as single reads.

The problem now is that the R1 reads in your test dataset have a /1 suffix, and the R2 reads have a /2 suffix, which makes strobealign treat them as single ends.

This is obfuscated a bit by a different part of the code (run later), where strobealign strips those same suffixes from the read names so that they do not appear in the SAM output.

Maybe we can fix this by stripping the suffixes earlier; I will look into this. Alternatively, we may want --interleaved to actually mean only interleaved and have a separate option for the mixed mode.

@akautt
Copy link
Author

akautt commented Apr 28, 2023

Thanks for the quick reply!

That makes sense. Either of those two options sound good! Would definitely be great if strobealign could handle fastqs with /1 and /2 suffixes, as that's the output of picard bam2fastq, which I'm using to pipe in input coming from an unmapped bam.

And pointing me to the suffix issue was actually all I needed for now, as I can just pipe the input through a sed one-liner. Just tried that and works like a charm (and everything is still way faster than bwa).

Thanks!

@luispedro
Copy link
Contributor

It seems that bwa trims the /[12] suffix (see https://github.com/lh3/bwa/blob/139f68fc4c3747813783a488aef2adc86626b01b/bwaseqio.c#L207) and I'd argue that strobealign should do the same.

I can look into that code

@marcelm
Copy link
Collaborator

marcelm commented Apr 28, 2023

strobealign trims the /1 and /2 suffixes when it outputs the SAM record (strip_suffix in sam.cpp). One problem is that @ksahlin wanted to keep the suffix for PAF output, so doing it instead when the records are read in won’t work.

@luispedro
Copy link
Contributor

This can be done when comparing the names for interleaved matching: compared them in a way that ignores the suffix

I'm on my phone now, but I can implement this later

@ksahlin
Copy link
Owner

ksahlin commented Apr 28, 2023

strobealign trims the /1 and /2 suffixes when it outputs the SAM record (strip_suffix in sam.cpp). One problem is that @ksahlin wanted to keep the suffix for PAF output, so doing it instead when the records are read in won’t work.

I am happy to reconsider this if if would offer a more convenient solution. I don't think paf is commonly used for short reads. For long reads we won't typically have the paired format.

Edit:
I pushed for keeping it the same for PAF because of the benchmark scripts were set up that way to measure accuracy etc. (otherwise there is no way of distinguishing the mates in the PAF, as can be done with the flag in the SAM file). But there might be auxiliary fields in the PAF format where we can add this info instead.

Benchmarking mapping-only is useful for development of the seeding scoring etc.

@marcelm
Copy link
Collaborator

marcelm commented Apr 28, 2023

Removing the suffixes when doing the name comparison during de-interleaving is totally doable, it is just slightly inelegant that this will then be done again when writing the SAM output. But that is the only downside; I cannot imagine this resulting in any type of measurable slowdown.

luispedro added a commit to luispedro/strobealign that referenced this issue Apr 28, 2023
A read named `read/1` followed by `read/2` should form a pair in the
interleaved format. This matches the behaviour of `bwa mem -p` as well

closes ksahlin#273
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants