No found reference sequences when headers look the same #86

shafferm · 2021-08-31T17:31:08Z

I am using coverm (version coverm 0.6.1 from bioconda) to get average coverage for some metagenome assemblies to deposit to NCBI. The command I am running is coverm genome -v -m mean -t 15 --bam-files /home/projects-wrighton/NIH_Salmonella/Salmonella/Metagenomes/Megahit/LM_megahit/LM.mapped.sorted.bam -f /home/projects-wrighton/NIH_Salmonella/Salmonella/Metagenomes/Megahit/LM_megahit/final.contigs.2500.fa. When I run this error:

[2021-08-30T21:24:43Z INFO  coverm::genome] Of 18649 reference IDs, 0 were assigned to a genome and 18649 were not
[2021-08-30T21:24:43Z ERROR coverm::genome] Error: There are no found reference sequences that are a part of a genome

The mappings were generated from this reference using bbmap. To dig into this I grabbed the headers from the fasta file using grep (uploaded here: LM_headers.txt) and the headers from the bam file using idxstats (uploaded here: LM.mapped.sorted.idxstats.txt). When I look between them it looks like the headers match. Am I missing something here? Is something in the headers making the matching break?

Thanks,
Mike

The text was updated successfully, but these errors were encountered:

wwood · 2021-08-31T21:11:03Z

Hi Mike,

Strange. Thanks for the report. Are you able to send a cut down BAM file and cut down fasta file so I can test? I suspect it might have something to do with the spaces in the names, but not sure yet. Maybe just email me.
Thanks.

shafferm · 2021-09-01T19:01:46Z

Hey Ben,

Here is a link to the bam file: https://www.dropbox.com/s/tbuwnzv5ywb7jtn/LM.mapped.sorted.subsample001.bam?dl=0
And here is the fasta file: https://www.dropbox.com/s/qjen3gr04oclcuh/LM.subsample001.contigs.fa.gz?dl=0

Happy to move this to email or share full files somewhere else if that is easier. I subsampled to 1% of reads for the bam file and in the fasta file I filtered out contigs with no hits in the subsampled bam. So there are headers in the bam that aren't in the fasta. Hope that isn't a problem.

These are contigs generated from megahit which always has those spaces. I didn't run into this problem with contigs from idba_ud which don't have spaces.

wwood · 2021-09-03T07:55:21Z

Hey Mike, it is the end of the week and I haven't gotten to this. A workaround might be putting the sequence names as input to --genome-definition rather than using -f?

shafferm · 2021-09-07T21:29:50Z

No problem not getting to this yet. I tried using --genome-definition but you can't use -r and --bam-files at the same time and I'd like to be able to use those mappings.

wwood · 2021-09-08T04:27:22Z

~~Hmm, sorry, confused. You can never use -r and --bam-files together at the same time, because coverm either does the mapping or uses a bam as input, not both. I'm not understanding something?~~

wwood · 2021-09-08T20:46:57Z

Hey sorry about the above comment, not sure what I was thinking. Will get back to you.

wwood · 2021-09-09T06:55:40Z

Hi Mike,

I "fixed" this by adding a new flag --use-full-contig-names to the dev branch:

$ cargo run -- genome -b LM.mapped.sorted.subsample001.bam --genome-fasta-files LM.subsample001.contigs.fa --use-full-contig-names
   Compiling coverm v0.6.1 (/home/ben/ownCloud/git/coverm)
    Finished dev [unoptimized + debuginfo] target(s) in 3.80s
     Running `/home/ben/ownCloud/git/coverm/target/debug/coverm genome -b LM.mapped.sorted.subsample001.bam --genome-fasta-files LM.subsample001.contigs.fa --use-full-contig-names`
[2021-09-09T06:52:53Z INFO  coverm] CoverM version 0.6.1
[2021-09-09T06:52:53Z INFO  coverm] Using min-covered-fraction 10%
[2021-09-09T06:52:53Z INFO  coverm::genome] Of 18649 reference IDs, 11050 were assigned to a genome and 7599 were not
[2021-09-09T06:53:02Z INFO  coverm::genome] In sample 'LM.mapped.sorted.subsample001', found 375265 reads mapped out of 441326 total (85.03%)
Genome	LM.mapped.sorted.subsample001 Relative Abundance (%)
unmapped	14.968753
LM.subsample001.contigs	85.03125

Hope that settles it? Thanks for the report. I hope to release a new version soon, but not entirely sure when - let me know if you need a static compile in the meantime.

wwood closed this as completed in 16be37a Sep 9, 2021

ttubb mentioned this issue Feb 9, 2022

Coverage not matching contig names Ecogenomics/CheckM#325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No found reference sequences when headers look the same #86

No found reference sequences when headers look the same #86

shafferm commented Aug 31, 2021

wwood commented Aug 31, 2021

shafferm commented Sep 1, 2021

wwood commented Sep 3, 2021

shafferm commented Sep 7, 2021

wwood commented Sep 8, 2021 •

edited

Loading

wwood commented Sep 8, 2021

wwood commented Sep 9, 2021

No found reference sequences when headers look the same #86

No found reference sequences when headers look the same #86

Comments

shafferm commented Aug 31, 2021

wwood commented Aug 31, 2021

shafferm commented Sep 1, 2021

wwood commented Sep 3, 2021

shafferm commented Sep 7, 2021

wwood commented Sep 8, 2021 • edited Loading

wwood commented Sep 8, 2021

wwood commented Sep 9, 2021

wwood commented Sep 8, 2021 •

edited

Loading