--interleaved option and SAM flag for paired reads #273

akautt · 2023-04-27T19:03:27Z

I might be missing something, but it seems that strobealign (v0.8.0) does not properly specify SAM flag values for paired reads

When I run it on separate R1 + R2 fastqs, reads in the bam have flags like 77 and 141 (paired unmapped reads; indicating first and second read in pair), but when I run it with '--interleaved' on, otherwise identical, interleaved fastq input, the flags do not contain any information on read pairs anymore (e.g., simply 4 for unmapped reads). Note that strobealign correctly indicates "paired-end mode" in both cases. The lack of read pair information cause issues for downstream applications for me.

Since everything works just fine –great actually! strobealign is amazingly fast; thanks for developing such a great tool– using separate fastqs, I hope it's not too hard to fix this for the --interleaved case. I'd really prefer interleaved mode over separate fastqs since it allows me to efficiently pipe input strobealign

Files and commands for a reproducible example below:

Separate R1 and R2 fastqs as input

strobealign -t 1 reference.fasta R1.fastq R2.fastq | head

> This is strobealign v0.8.0
Estimated read length: 151 bp
Time reading reference: 0.00 s
Reference size: 0.02 Mbp (1 contig; largest: 0.02 Mbp)
Indexing ...
  Time generating seeds: 0.00 s
  Time estimating number of unique hashes: 0.00 s
  Time sorting non-unique seeds: 0.00 s
  Time generating hash table index: 0.00 s
Total time indexing: 0.00 s
Running in @HD  VN:1.6  SO:unsorted
@SQ     SN:MT   LN:16322
@PG     ID:strobealign  PN:strobealign  VN:v0.8.0       CL:strobealign
paired-end mode
A00794:601:HC32JDSX3:1:1101:10004:15248 77      *       0       0       *       *       0       0       CCATGATGGACCGCTTGTGCCACCTACTGGCCAATGGTAGTATCACATGCCGGATATCTGCCTATCCAGGAGAGGTAGCTAGAAAATGCAACCACCGAGTGTGCACACACACTGTGTGGGTACACACACACATGTGCATATACGCACTCAC      FFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFF:FFFFFFF:FFFFFFF:FFFFFFFFFFF,FFFF:FF:FFFFF::FFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFF:FFFFFFFFFFFFF::FF:FFFFFFF
A00794:601:HC32JDSX3:1:1101:10004:15248 141     *       0       0       *       *       0       0       ATATAATGCCAACATGTATGATTCTGTATGTGAGTGCGTATATGCACATGTGTGTGTGTACCCACACAGTGTGTGTGCACACTCGGTGGTTGCATTTTCTAGCTACCTCTCCTGGATAGGCAGATATCCGGCATGTGATACTACCATTGGC      FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFF:FFFFFF,FFFFFFFFFFFFFFFFFFFFF:FF:FFF:FFFF
A00794:601:HC32JDSX3:1:1101:10004:34632 77      *       0       0       *       *       0       0       CACCTTAACTACTCAGATAAGTGACTGTCATTTGTACCTGAGAATGCCAAAGGGTTAAAGATGTCCAGGATCCATGTAAATGAGTTTTTGTCCCCAAAGAAATTTGTACAAATTTTAGGACACACAAATAAATTAGCACAAGTTAGCAAGG      FFFF:,FFFFFFF:FFFFFF::FFFFFFFFFFFFFFFFFFFFFFFF:F,FFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFF:,
A00794:601:HC32JDSX3:1:1101:10004:34632 141     *       0       0       *       *       0       0       GGACAAGGAACCTTGCTATCTGACACACCTCTTGGCAAAGTCTAAAGGAAACCATGACTCTGGTGACTAAGAAGTGACCCTGAGCTCCCCACTCACACCCCTTGCTAACTTGTGCTAATTTATTTGTGTGTCCTAAAATTTGTACAAATTT      FFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:,FFFFF:FF,FFF:FFFFFFFFFFF:FFFFF:FFFFFFFFF
A00794:601:HC32JDSX3:1:1101:1000:25989  77      *       0       0       *       *       0       0       NTCCTGCCCCTCTTTGGACGAGAAGATGTGGGGTTTTAGGCATGCTTCAACCCCTTCTTAGTTTTAGGCCAGCCTTGCAGATATTAGGCAAGGATATTTTGTATGAATCAAGGATACAGCATATTCATTACACTTCTGGTCACTGTGACAT      #FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
A00794:601:HC32JDSX3:1:1101:1000:25989  141     *       0       0       *       *       0       0       ATGTCACAGTGACCAGAAGTGTAATGAATATGCTGTATCCTTGATTCATACAAAATATCCTTGCCTAATATCTGCAAGGCTGGCCTAAAACTAAGAAGGGGTTGAAGCATGCCTAAAACCCCACATCTTCTCGTCCAAAGAGGGGCAGGAA      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:

Interleaved mode

strobealign -t 1 reference.fasta interleaved.fastq --interleaved | head

Estimated read length: 151 bp
Time reading reference: 0.00 s
Reference size: 0.02 Mbp (1 contig; largest: 0.02 Mbp)
Indexing ...
  Time generating seeds: 0.00 s
  Time estimating number of unique hashes: 0.00 s
  Time sorting non-unique seeds: 0.00 s
  Time generating hash table index: 0.00 s
Total time indexing: 0.00 s
Running in @HD  VN:1.6  SO:unsorted
@SQ     SN:MT   LN:16322
@PG     ID:strobealign  PN:strobealign  VN:v0.8.0       CL:strobealign
paired-end mode
A00794:601:HC32JDSX3:1:1101:10004:15248 4       *       0       0       *       *       0       0       CCATGATGGACCGCTTGTGCCACCTACTGGCCAATGGTAGTATCACATGCCGGATATCTGCCTATCCAGGAGAGGTAGCTAGAAAATGCAACCACCGAGTGTGCACACACACTGTGTGGGTACACACACACATGTGCATATACGCACTCAC      FFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFF:FFFFFFF:FFFFFFF:FFFFFFFFFFF,FFFF:FF:FFFFF::FFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFF:FFFFFFFFFFFFF::FF:FFFFFFF
A00794:601:HC32JDSX3:1:1101:10004:15248 4       *       0       0       *       *       0       0       ATATAATGCCAACATGTATGATTCTGTATGTGAGTGCGTATATGCACATGTGTGTGTGTACCCACACAGTGTGTGTGCACACTCGGTGGTTGCATTTTCTAGCTACCTCTCCTGGATAGGCAGATATCCGGCATGTGATACTACCATTGGC      FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFF:FFFFFF,FFFFFFFFFFFFFFFFFFFFF:FF:FFF:FFFF
A00794:601:HC32JDSX3:1:1101:10004:34632 4       *       0       0       *       *       0       0       CACCTTAACTACTCAGATAAGTGACTGTCATTTGTACCTGAGAATGCCAAAGGGTTAAAGATGTCCAGGATCCATGTAAATGAGTTTTTGTCCCCAAAGAAATTTGTACAAATTTTAGGACACACAAATAAATTAGCACAAGTTAGCAAGG      FFFF:,FFFFFFF:FFFFFF::FFFFFFFFFFFFFFFFFFFFFFFF:F,FFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFF:,
A00794:601:HC32JDSX3:1:1101:10004:34632 4       *       0       0       *       *       0       0       GGACAAGGAACCTTGCTATCTGACACACCTCTTGGCAAAGTCTAAAGGAAACCATGACTCTGGTGACTAAGAAGTGACCCTGAGCTCCCCACTCACACCCCTTGCTAACTTGTGCTAATTTATTTGTGTGTCCTAAAATTTGTACAAATTT      FFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:,FFFFF:FF,FFF:FFFFFFFFFFF:FFFFF:FFFFFFFFF
A00794:601:HC32JDSX3:1:1101:1000:25989  4       *       0       0       *       *       0       0       NTCCTGCCCCTCTTTGGACGAGAAGATGTGGGGTTTTAGGCATGCTTCAACCCCTTCTTAGTTTTAGGCCAGCCTTGCAGATATTAGGCAAGGATATTTTGTATGAATCAAGGATACAGCATATTCATTACACTTCTGGTCACTGTGACAT      #FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
A00794:601:HC32JDSX3:1:1101:1000:25989  4       *       0       0       *       *       0       0       ATGTCACAGTGACCAGAAGTGTAATGAATATGCTGTATCCTTGATTCATACAAAATATCCTTGCCTAATATCTGCAAGGCTGGCCTAAAACTAAGAAGGGGTTGAAGCATGCCTAAAACCCCACATCTTCTCGTCCAAAGAGGGGCAGGAA      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:

strobealign_interleaved_test.tar.gz

The text was updated successfully, but these errors were encountered:

marcelm · 2023-04-27T22:16:56Z

Thanks for reporting! Thanks also for the test case, which allowed me to reproduce this problem.

The issue appears to be the following: strobealign’s --interleaved option actually enables a "mixed mode" (I believe it is equivalent to the bwa mem option -p) that allows input from a single file to contain both single and paired-end reads. This is achieved by looking at the read names: If two consecutive reads in the file have the same name, they are considered to be pairs. If the names are different, they are treated as single reads.

The problem now is that the R1 reads in your test dataset have a /1 suffix, and the R2 reads have a /2 suffix, which makes strobealign treat them as single ends.

This is obfuscated a bit by a different part of the code (run later), where strobealign strips those same suffixes from the read names so that they do not appear in the SAM output.

Maybe we can fix this by stripping the suffixes earlier; I will look into this. Alternatively, we may want --interleaved to actually mean only interleaved and have a separate option for the mixed mode.

akautt · 2023-04-28T00:33:52Z

Thanks for the quick reply!

That makes sense. Either of those two options sound good! Would definitely be great if strobealign could handle fastqs with /1 and /2 suffixes, as that's the output of picard bam2fastq, which I'm using to pipe in input coming from an unmapped bam.

And pointing me to the suffix issue was actually all I needed for now, as I can just pipe the input through a sed one-liner. Just tried that and works like a charm (and everything is still way faster than bwa).

Thanks!

luispedro · 2023-04-28T11:07:06Z

It seems that bwa trims the /[12] suffix (see https://github.com/lh3/bwa/blob/139f68fc4c3747813783a488aef2adc86626b01b/bwaseqio.c#L207) and I'd argue that strobealign should do the same.

I can look into that code

marcelm · 2023-04-28T11:23:35Z

strobealign trims the /1 and /2 suffixes when it outputs the SAM record (strip_suffix in sam.cpp). One problem is that @ksahlin wanted to keep the suffix for PAF output, so doing it instead when the records are read in won’t work.

luispedro · 2023-04-28T11:32:45Z

This can be done when comparing the names for interleaved matching: compared them in a way that ignores the suffix

I'm on my phone now, but I can implement this later

ksahlin · 2023-04-28T11:33:22Z

strobealign trims the /1 and /2 suffixes when it outputs the SAM record (strip_suffix in sam.cpp). One problem is that @ksahlin wanted to keep the suffix for PAF output, so doing it instead when the records are read in won’t work.

I am happy to reconsider this if if would offer a more convenient solution. I don't think paf is commonly used for short reads. For long reads we won't typically have the paired format.

Edit:
I pushed for keeping it the same for PAF because of the benchmark scripts were set up that way to measure accuracy etc. (otherwise there is no way of distinguishing the mates in the PAF, as can be done with the flag in the SAM file). But there might be auxiliary fields in the PAF format where we can add this info instead.

Benchmarking mapping-only is useful for development of the seeding scoring etc.

marcelm · 2023-04-28T12:10:53Z

Removing the suffixes when doing the name comparison during de-interleaving is totally doable, it is just slightly inelegant that this will then be done again when writing the SAM output. But that is the only downside; I cannot imagine this resulting in any type of measurable slowdown.

A read named `read/1` followed by `read/2` should form a pair in the interleaved format. This matches the behaviour of `bwa mem -p` as well closes ksahlin#273

marcelm mentioned this issue Apr 28, 2023

Does --interleaved need to support mixed input? #274

Open

luispedro mentioned this issue Apr 28, 2023

Ignore /1/2 suffix when matching interleaved reads #275

Merged

marcelm closed this as completed in #275 May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--interleaved option and SAM flag for paired reads #273

--interleaved option and SAM flag for paired reads #273

akautt commented Apr 27, 2023

marcelm commented Apr 27, 2023 •

edited

Loading

akautt commented Apr 28, 2023

luispedro commented Apr 28, 2023

marcelm commented Apr 28, 2023

luispedro commented Apr 28, 2023

ksahlin commented Apr 28, 2023 •

edited

Loading

marcelm commented Apr 28, 2023 •

edited

Loading

--interleaved option and SAM flag for paired reads #273

--interleaved option and SAM flag for paired reads #273

Comments

akautt commented Apr 27, 2023

marcelm commented Apr 27, 2023 • edited Loading

akautt commented Apr 28, 2023

luispedro commented Apr 28, 2023

marcelm commented Apr 28, 2023

luispedro commented Apr 28, 2023

ksahlin commented Apr 28, 2023 • edited Loading

marcelm commented Apr 28, 2023 • edited Loading

marcelm commented Apr 27, 2023 •

edited

Loading

ksahlin commented Apr 28, 2023 •

edited

Loading

marcelm commented Apr 28, 2023 •

edited

Loading