Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No found reference sequences when headers look the same #86

Closed
shafferm opened this issue Aug 31, 2021 · 7 comments
Closed

No found reference sequences when headers look the same #86

shafferm opened this issue Aug 31, 2021 · 7 comments

Comments

@shafferm
Copy link

Hello @wwood,

I am using coverm (version coverm 0.6.1 from bioconda) to get average coverage for some metagenome assemblies to deposit to NCBI. The command I am running is coverm genome -v -m mean -t 15 --bam-files /home/projects-wrighton/NIH_Salmonella/Salmonella/Metagenomes/Megahit/LM_megahit/LM.mapped.sorted.bam -f /home/projects-wrighton/NIH_Salmonella/Salmonella/Metagenomes/Megahit/LM_megahit/final.contigs.2500.fa. When I run this error:

[2021-08-30T21:24:43Z INFO  coverm::genome] Of 18649 reference IDs, 0 were assigned to a genome and 18649 were not
[2021-08-30T21:24:43Z ERROR coverm::genome] Error: There are no found reference sequences that are a part of a genome

The mappings were generated from this reference using bbmap. To dig into this I grabbed the headers from the fasta file using grep (uploaded here: LM_headers.txt) and the headers from the bam file using idxstats (uploaded here: LM.mapped.sorted.idxstats.txt). When I look between them it looks like the headers match. Am I missing something here? Is something in the headers making the matching break?

Thanks,
Mike

@wwood
Copy link
Owner

wwood commented Aug 31, 2021

Hi Mike,

Strange. Thanks for the report. Are you able to send a cut down BAM file and cut down fasta file so I can test? I suspect it might have something to do with the spaces in the names, but not sure yet. Maybe just email me.
Thanks.

@shafferm
Copy link
Author

shafferm commented Sep 1, 2021

Hey Ben,

Here is a link to the bam file: https://www.dropbox.com/s/tbuwnzv5ywb7jtn/LM.mapped.sorted.subsample001.bam?dl=0
And here is the fasta file: https://www.dropbox.com/s/qjen3gr04oclcuh/LM.subsample001.contigs.fa.gz?dl=0

Happy to move this to email or share full files somewhere else if that is easier. I subsampled to 1% of reads for the bam file and in the fasta file I filtered out contigs with no hits in the subsampled bam. So there are headers in the bam that aren't in the fasta. Hope that isn't a problem.

These are contigs generated from megahit which always has those spaces. I didn't run into this problem with contigs from idba_ud which don't have spaces.

@wwood
Copy link
Owner

wwood commented Sep 3, 2021

Hey Mike, it is the end of the week and I haven't gotten to this. A workaround might be putting the sequence names as input to --genome-definition rather than using -f?

@shafferm
Copy link
Author

shafferm commented Sep 7, 2021

No problem not getting to this yet. I tried using --genome-definition but you can't use -r and --bam-files at the same time and I'd like to be able to use those mappings.

@wwood
Copy link
Owner

wwood commented Sep 8, 2021

Hmm, sorry, confused. You can never use -r and --bam-files together at the same time, because coverm either does the mapping or uses a bam as input, not both. I'm not understanding something?

@wwood
Copy link
Owner

wwood commented Sep 8, 2021

Hey sorry about the above comment, not sure what I was thinking. Will get back to you.

@wwood wwood closed this as completed in 16be37a Sep 9, 2021
@wwood
Copy link
Owner

wwood commented Sep 9, 2021

Hi Mike,

I "fixed" this by adding a new flag --use-full-contig-names to the dev branch:

$ cargo run -- genome -b LM.mapped.sorted.subsample001.bam --genome-fasta-files LM.subsample001.contigs.fa --use-full-contig-names
   Compiling coverm v0.6.1 (/home/ben/ownCloud/git/coverm)
    Finished dev [unoptimized + debuginfo] target(s) in 3.80s
     Running `/home/ben/ownCloud/git/coverm/target/debug/coverm genome -b LM.mapped.sorted.subsample001.bam --genome-fasta-files LM.subsample001.contigs.fa --use-full-contig-names`
[2021-09-09T06:52:53Z INFO  coverm] CoverM version 0.6.1
[2021-09-09T06:52:53Z INFO  coverm] Using min-covered-fraction 10%
[2021-09-09T06:52:53Z INFO  coverm::genome] Of 18649 reference IDs, 11050 were assigned to a genome and 7599 were not
[2021-09-09T06:53:02Z INFO  coverm::genome] In sample 'LM.mapped.sorted.subsample001', found 375265 reads mapped out of 441326 total (85.03%)
Genome	LM.mapped.sorted.subsample001 Relative Abundance (%)
unmapped	14.968753
LM.subsample001.contigs	85.03125

Hope that settles it? Thanks for the report. I hope to release a new version soon, but not entirely sure when - let me know if you need a static compile in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants