Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the two detection methods #65

Open
CHENAO-QIAN opened this issue Jul 29, 2024 · 7 comments
Open

Question about the two detection methods #65

CHENAO-QIAN opened this issue Jul 29, 2024 · 7 comments

Comments

@CHENAO-QIAN
Copy link

CHENAO-QIAN commented Jul 29, 2024

Hi,

I am trying the two methods: phmm and edlib. I found phmm has a high reproducibility, giving the same result between different runs. While edlib can give very different results. I am wondering if there is a recommendation regarding in which situation to use which method?

Thanks!

@CHENAO-QIAN CHENAO-QIAN changed the title Question about the two detection methods: phmm and edlib Question about the two detection methods Jul 29, 2024
@NanoCoreUSA
Copy link

I've have also noticed this recently, specifically in regard to how it handles UMIs with the '-U' flag. I've posted some examples below to illustrate (ran on a FASTQ file with one read, with the last two lines i.e., quality scores removed for clarity).

Here's starting input file (i.e., input to pychopper) -

@89afb5e5-88ea-4097-b1f1-8d84d48f1d0e runid=6578f753d630b7dedc05b3d42e9ab21a8b375a15 sampleid=secondBatch_1 read=51236 ch=329 start_time=2023-01-24T04:42:25Z model_version_id=2021-05-17_dna_r9.4.1_minion_768_2f1c8637 barcode=barcode11
TTGTACTTCGGACATTCTGTTGGTGTTGCTCCGGATCGCCACGCCTACCGTGACGTCCTGCATCTATCGGAGGGAATGGATTTCTGTTGGTGCTGATATTGCTCCCATTGGCATTGACCTTAAGCTTTGGGGGTCGGTGCCGCATCCCAACCCGCCGCCGCAGCCGCCTACAAACTGGTGCTGATCCGGCACGGCGAAGGCGCATGGAACCTGGAGAACCGCTTCAGCGGCTGGTACGACGCCCGGCCTGAACCAGCCGGACGAAATGAAGCGCGGCGAGGCGGCTACGAGATGCTGTTGCTGGTTGGCAACATCATTACTCGTTGAAGGCGATCCGGGCCCTCTGGACAGTGCTAGATGCCATTGATCAGATGTGGCTGCCAGTGGTGAGGACTTGGCGCCTCAATGAGCGGCACTATGGGGGTCTAACCGGTCTCAATAAAGCAGAAACTGCTGCAAAGCATGGTGAGGCCCAGGTGAAGATCTGGAGGCGCTCTATGATGTCCCACCACCTCCGATGGAGCCCGACCATCCTTTCTACAGCAACATCAGTAAGGATCGCGGGTATGCAGACCTCACAGAAGATCAGCTACCCTCCTGTGAGAGTCTGAAGGATACTATTGCCAGAGCTCTGCCCTTCTGGAATGAAGAAATGATTCCCCAGATCAAGGAGGGGAAACGTGTACTGATTGCAGCCCATGGCAAGCCTCCGGGGCATTGTCAAGCACTCGGAGGTCTCTCTGAAGGCTATCATGGAGCTGAACCTGCCGACTGGTATTCCCATTGTCTATGAATTGGGCAGGAACTTGAAGCCTGTCAAGCCCATGCAGTTCTGGGGGATGAAGAGACGGTGCCGCAAGCCATGGAAGCTGTGGCTGCCCAGGCAGAGCCAAGAAGTGAAGGCCGGCGGGGAGGATGCCTCCCCCAGGAGCACCCTCCCGTCTTGTCCTCATGCCCTCCCACCTGCACATGTCACACTGACCACATCTGTAGACATCTTGAGTTGTAGCTGCAGGCTGGGGGACCAGTGGCTCCCATTTTCATTTTAGCCATTTTGTCATGCCACCCACTCCCTTCATACAATCTAGTAAGAATAGCAGTTCTAGAGCACAGGTTCTCAGTCTAAGCTATGGAAAACACATATCCAACAGAGTTTAAAGTAGTGACTTGGGTTTTTGCGAGTGCTTTGTTTACTAAGGACTTTGGGGAGGAACCATGCTAAGCCATGACCAGTGAGGAGAAGCAACAGAGCCTGTCTGTCCCCATGAGCGGAGTCTGTCCTCTGCTCTTCTGCAGTCAGGTCACTGCCTACTGCCTGGGGGCTCTAGTCATTCCAGTGGAAGACGAATGTAACCTGCGTGGTGATGTGACAACTGTTTCCTCCTGACCCCAGAGGATCTGGCTCTAGGTTGGGATCAATCCTGAATTTCGTTATGTGTTAATTTACTTTTATTAAAAAAGTATAGTATATATAATACAAAACAATAACCCTTCTGGGGTTTCTTGTGGCAGATTTGAAATAGTCCCACATGTGGTCATCAGAAAATAAGCCATTCCTCATACCAATATAGGATCAGCTCCTTGACCTCTGAGGGGCAGGAGTGCTTCCTGGTGTGTGTATTAGAATCCCTTCCTGCCTTGTTTCATGGCAGTGAAATGCCTCTTGGTCCTGTCCAAGTGTATCTTTCACTGATTTCTGAATCATGTTTCTAGTTGCTTGACCCTGCCACATGGGTCCAGTGTTCATCTGAGCATAACTGTACTAAATCCTTTCCATATCAGTATAATAAAGGAGTGATGTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCTTGCGGGCGGCGGACTCTCCTCTGAAGATAGAGCGGCAGGCAAGTTCCATTCCCTCTATAGATGAAACGTCA

Running the following pychopper command (i.e., using 'phmm' model) returns this -

pychopper -U -k PCS111 -m phmm barcode11_combinedReads.head.fastq pychopper.phmm.fastq

@132:1882|89afb5e5-88ea-4097-b1f1-8d84d48f1d0e runid=6578f753d630b7dedc05b3d42e9ab21a8b375a15 sampleid=secondBatch_1 read=51236 ch=329 start_time=2023-01-24T04:42:25Z model_version_id=2021-05-17_dna_r9.4.1_minion_768_2f1c8637 barcode=barcode11 strand=+ umi=TTGCTCCCATTGGCATTGACCTTAAGCTTT
GTCGGTGCCGCATCCCAACCCGCCGCCGCAGCCGCCTACAAACTGGTGCTGATCCGGCACGGCGAAGGCGCATGGAACCTGGAGAACCGCTTCAGCGGCTGGTACGACGCCCGGCCTGAACCAGCCGGACGAAATGAAGCGCGGCGAGGCGGCTACGAGATGCTGTTGCTGGTTGGCAACATCATTACTCGTTGAAGGCGATCCGGGCCCTCTGGACAGTGCTAGATGCCATTGATCAGATGTGGCTGCCAGTGGTGAGGACTTGGCGCCTCAATGAGCGGCACTATGGGGGTCTAACCGGTCTCAATAAAGCAGAAACTGCTGCAAAGCATGGTGAGGCCCAGGTGAAGATCTGGAGGCGCTCTATGATGTCCCACCACCTCCGATGGAGCCCGACCATCCTTTCTACAGCAACATCAGTAAGGATCGCGGGTATGCAGACCTCACAGAAGATCAGCTACCCTCCTGTGAGAGTCTGAAGGATACTATTGCCAGAGCTCTGCCCTTCTGGAATGAAGAAATGATTCCCCAGATCAAGGAGGGGAAACGTGTACTGATTGCAGCCCATGGCAAGCCTCCGGGGCATTGTCAAGCACTCGGAGGTCTCTCTGAAGGCTATCATGGAGCTGAACCTGCCGACTGGTATTCCCATTGTCTATGAATTGGGCAGGAACTTGAAGCCTGTCAAGCCCATGCAGTTCTGGGGGATGAAGAGACGGTGCCGCAAGCCATGGAAGCTGTGGCTGCCCAGGCAGAGCCAAGAAGTGAAGGCCGGCGGGGAGGATGCCTCCCCCAGGAGCACCCTCCCGTCTTGTCCTCATGCCCTCCCACCTGCACATGTCACACTGACCACATCTGTAGACATCTTGAGTTGTAGCTGCAGGCTGGGGGACCAGTGGCTCCCATTTTCATTTTAGCCATTTTGTCATGCCACCCACTCCCTTCATACAATCTAGTAAGAATAGCAGTTCTAGAGCACAGGTTCTCAGTCTAAGCTATGGAAAACACATATCCAACAGAGTTTAAAGTAGTGACTTGGGTTTTTGCGAGTGCTTTGTTTACTAAGGACTTTGGGGAGGAACCATGCTAAGCCATGACCAGTGAGGAGAAGCAACAGAGCCTGTCTGTCCCCATGAGCGGAGTCTGTCCTCTGCTCTTCTGCAGTCAGGTCACTGCCTACTGCCTGGGGGCTCTAGTCATTCCAGTGGAAGACGAATGTAACCTGCGTGGTGATGTGACAACTGTTTCCTCCTGACCCCAGAGGATCTGGCTCTAGGTTGGGATCAATCCTGAATTTCGTTATGTGTTAATTTACTTTTATTAAAAAAGTATAGTATATATAATACAAAACAATAACCCTTCTGGGGTTTCTTGTGGCAGATTTGAAATAGTCCCACATGTGGTCATCAGAAAATAAGCCATTCCTCATACCAATATAGGATCAGCTCCTTGACCTCTGAGGGGCAGGAGTGCTTCCTGGTGTGTGTATTAGAATCCCTTCCTGCCTTGTTTCATGGCAGTGAAATGCCTCTTGGTCCTGTCCAAGTGTATCTTTCACTGATTTCTGAATCATGTTTCTAGTTGCTTGACCCTGCCACATGGGTCCAGTGTTCATCTGAGCATAACTGTACTAAATCCTTTCCATATCAGTATAATAAAGGAGTGATGTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

This identifies the UMI as umi=TTGCTCCCATTGGCATTGACCTTAAGCTTT and the new sequence begins with GTCGGTGCCG. Refering back to the original input file, it seems the 'phmm' model is actually removing the UMI from the read (as well as few extra nucleotides, i.e., 'GGGG'). See here, where the first highlight is UMI and the second is where the new sequence begins -

TTGCTCCCATTGGCATTGACCTTAAGCTTTGGGGGTCGGTGCCG

In contrast, running this command (i.e., the 'edlib' model) returns this -

pychopper -U -k PCS111 -m edlib barcode11_combinedReads.head.fastq pychopper.edlib.fastq

@103:1885|89afb5e5-88ea-4097-b1f1-8d84d48f1d0e runid=6578f753d630b7dedc05b3d42e9ab21a8b375a15 sampleid=secondBatch_1 read=51236 ch=329 start_time=2023-01-24T04:42:25Z model_version_id=2021-05-17_dna_r9.4.1_minion_768_2f1c8637 barcode=barcode11 strand=+ umi=TTGCTCCCATTGGCATTGACCTTAAGCTTT
CCCATTGGCATTGACCTTAAGCTTTGGGGGTCGGTGCCGCATCCCAACCCGCCGCCGCAGCCGCCTACAAACTGGTGCTGATCCGGCACGGCGAAGGCGCATGGAACCTGGAGAACCGCTTCAGCGGCTGGTACGACGCCCGGCCTGAACCAGCCGGACGAAATGAAGCGCGGCGAGGCGGCTACGAGATGCTGTTGCTGGTTGGCAACATCATTACTCGTTGAAGGCGATCCGGGCCCTCTGGACAGTGCTAGATGCCATTGATCAGATGTGGCTGCCAGTGGTGAGGACTTGGCGCCTCAATGAGCGGCACTATGGGGGTCTAACCGGTCTCAATAAAGCAGAAACTGCTGCAAAGCATGGTGAGGCCCAGGTGAAGATCTGGAGGCGCTCTATGATGTCCCACCACCTCCGATGGAGCCCGACCATCCTTTCTACAGCAACATCAGTAAGGATCGCGGGTATGCAGACCTCACAGAAGATCAGCTACCCTCCTGTGAGAGTCTGAAGGATACTATTGCCAGAGCTCTGCCCTTCTGGAATGAAGAAATGATTCCCCAGATCAAGGAGGGGAAACGTGTACTGATTGCAGCCCATGGCAAGCCTCCGGGGCATTGTCAAGCACTCGGAGGTCTCTCTGAAGGCTATCATGGAGCTGAACCTGCCGACTGGTATTCCCATTGTCTATGAATTGGGCAGGAACTTGAAGCCTGTCAAGCCCATGCAGTTCTGGGGGATGAAGAGACGGTGCCGCAAGCCATGGAAGCTGTGGCTGCCCAGGCAGAGCCAAGAAGTGAAGGCCGGCGGGGAGGATGCCTCCCCCAGGAGCACCCTCCCGTCTTGTCCTCATGCCCTCCCACCTGCACATGTCACACTGACCACATCTGTAGACATCTTGAGTTGTAGCTGCAGGCTGGGGGACCAGTGGCTCCCATTTTCATTTTAGCCATTTTGTCATGCCACCCACTCCCTTCATACAATCTAGTAAGAATAGCAGTTCTAGAGCACAGGTTCTCAGTCTAAGCTATGGAAAACACATATCCAACAGAGTTTAAAGTAGTGACTTGGGTTTTTGCGAGTGCTTTGTTTACTAAGGACTTTGGGGAGGAACCATGCTAAGCCATGACCAGTGAGGAGAAGCAACAGAGCCTGTCTGTCCCCATGAGCGGAGTCTGTCCTCTGCTCTTCTGCAGTCAGGTCACTGCCTACTGCCTGGGGGCTCTAGTCATTCCAGTGGAAGACGAATGTAACCTGCGTGGTGATGTGACAACTGTTTCCTCCTGACCCCAGAGGATCTGGCTCTAGGTTGGGATCAATCCTGAATTTCGTTATGTGTTAATTTACTTTTATTAAAAAAGTATAGTATATATAATACAAAACAATAACCCTTCTGGGGTTTCTTGTGGCAGATTTGAAATAGTCCCACATGTGGTCATCAGAAAATAAGCCATTCCTCATACCAATATAGGATCAGCTCCTTGACCTCTGAGGGGCAGGAGTGCTTCCTGGTGTGTGTATTAGAATCCCTTCCTGCCTTGTTTCATGGCAGTGAAATGCCTCTTGGTCCTGTCCAAGTGTATCTTTCACTGATTTCTGAATCATGTTTCTAGTTGCTTGACCCTGCCACATGGGTCCAGTGTTCATCTGAGCATAACTGTACTAAATCCTTTCCATATCAGTATAATAAAGGAGTGATGTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The identified UMI is the same, however now the new sequence still contains the UMI (albeit a truncated form missing the first five nucleotides) and also includes the 'GGGG' spacer the was trimmed from the 'phmm' model. See here from the first line of the pychopper output file -

CCCATTGGCATTGACCTTAAGCTTTGGGGGTCGGTGCCG

So, the two models are definitely behaving differently, with the 'phmm' removing the UMI as well extra nucleotides, and the 'edlib' keeping the UMI in the sequence but with some nucleotides truncated.

To the original post point, it would be helpful to know what's going on here and what the recommendations are. At first glance it would seem the default 'phmm' model is ideal since the UMI is removed (though this does seem to refute #49 indicating the UMI is not trimmed...), but it's also removing extra sequence and unclear if that's supposed to be a part of the read or not.

Any feedback would be appreciated as we need this info for our pipeline development!

@nrhorner
Copy link
Contributor

@CHENAO-QIAN Could you give an example of the command you used with the edlib option and what you are finding that is different between the runs please?

@CHENAO-QIAN
Copy link
Author

Hi @nrhorner,

The cmd I used:
pychopper -k PCS111 \ -m edlib \ -t 50 \ -r ${id}_pychopper_report.pdf \ -u ${id}_pychopper.unclassified.fq \ -w ${id}_pychopper.rescued.fq \ -S ${id}_pychopper.stats \ -A ${id}_pychopper.scores \ $line ${id}_pychopped.fastq

I ran the above code twice and the outputs I am getting are different. I did a quick check of the number of lines of the two outputs:

wc -l sample_pychopped3.fastq
352090580 sample_pychopped3.fastq

wc -l sample_pychopper.rescued3.fq
15962440 sample_pychopper.rescued3.fq

wc -l sample_pychopper.unclassified3.fq
34285372 sample_pychopper.unclassified3.fq
wc -l sample_pychopped6.fastq
350023924 sample_pychopped6.fastq

wc -l sample_pychopper.rescued6.fq
15446508 sample_pychopper.rescued6.fq

wc -l sample_pychopper.unclassified6.fq
37104128 sample_pychopper.unclassified6.fq

@NanoCoreUSA
Copy link

@nrhorner Any update on this issue? We're trying to finalize our analysis pipeline and would like to confident in the model we use.

@nrhorner
Copy link
Contributor

Hi @CHENAO-QIAN

My apologies for taking so long to get round to this.

In the initial stage of pychopper, a primer alignment cutoff score is tuned for a value that returns the most full length reads. The step involves taking a random sample of the reads and testing it with a range of alignment score cutoff values. This randomness can lead to a different cuttoffs being selected on different identical runs. Look at the output from pychopper to see is this value is different between runs for you.

Best cutoff (q) value is 0.4138

The cutoff can be set up front with -q. If you try this, I expect the results will be identical for you between runs.

I think there should probably be a random seed, in the code, before the read sampling to make this reproducible.

@nrhorner
Copy link
Contributor

nrhorner commented Dec 2, 2024

Hi @CHENAO-QIAN

Looking at your original question again. The difference between edlib and hmm that you are seeing probably stem from the fact that edlib is being tunes for alignment score and phhm for E-value. It might be that a larger sample size (-Y) for tuning the cutoff might improve per run reproducibly. In terms of which method to choose, edlib is faster and phmm often has a higher recovery rate.

@nrhorner
Copy link
Contributor

nrhorner commented Dec 2, 2024

@NanoCoreUSA I think your issue is separate to the original post. I am looking into it now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants