-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the two detection methods #65
Comments
I've have also noticed this recently, specifically in regard to how it handles UMIs with the '-U' flag. I've posted some examples below to illustrate (ran on a FASTQ file with one read, with the last two lines i.e., quality scores removed for clarity). Here's starting input file (i.e., input to pychopper) -
Running the following pychopper command (i.e., using 'phmm' model) returns this -
This identifies the UMI as TTGCTCCCATTGGCATTGACCTTAAGCTTTGGGGGTCGGTGCCG In contrast, running this command (i.e., the 'edlib' model) returns this -
The identified UMI is the same, however now the new sequence still contains the UMI (albeit a truncated form missing the first five nucleotides) and also includes the 'GGGG' spacer the was trimmed from the 'phmm' model. See here from the first line of the pychopper output file - CCCATTGGCATTGACCTTAAGCTTTGGGGGTCGGTGCCG So, the two models are definitely behaving differently, with the 'phmm' removing the UMI as well extra nucleotides, and the 'edlib' keeping the UMI in the sequence but with some nucleotides truncated. To the original post point, it would be helpful to know what's going on here and what the recommendations are. At first glance it would seem the default 'phmm' model is ideal since the UMI is removed (though this does seem to refute #49 indicating the UMI is not trimmed...), but it's also removing extra sequence and unclear if that's supposed to be a part of the read or not. Any feedback would be appreciated as we need this info for our pipeline development! |
@CHENAO-QIAN Could you give an example of the command you used with the edlib option and what you are finding that is different between the runs please? |
Hi @nrhorner, The cmd I used: I ran the above code twice and the outputs I am getting are different. I did a quick check of the number of lines of the two outputs:
|
@nrhorner Any update on this issue? We're trying to finalize our analysis pipeline and would like to confident in the model we use. |
Hi @CHENAO-QIAN My apologies for taking so long to get round to this. In the initial stage of pychopper, a primer alignment cutoff score is tuned for a value that returns the most full length reads. The step involves taking a random sample of the reads and testing it with a range of alignment score cutoff values. This randomness can lead to a different cuttoffs being selected on different identical runs. Look at the output from pychopper to see is this value is different between runs for you.
The cutoff can be set up front with I think there should probably be a random seed, in the code, before the read sampling to make this reproducible. |
Hi @CHENAO-QIAN Looking at your original question again. The difference between edlib and hmm that you are seeing probably stem from the fact that edlib is being tunes for alignment score and phhm for E-value. It might be that a larger sample size (-Y) for tuning the cutoff might improve per run reproducibly. In terms of which method to choose, edlib is faster and phmm often has a higher recovery rate. |
@NanoCoreUSA I think your issue is separate to the original post. I am looking into it now |
Hi,
I am trying the two methods: phmm and edlib. I found phmm has a high reproducibility, giving the same result between different runs. While edlib can give very different results. I am wondering if there is a recommendation regarding in which situation to use which method?
Thanks!
The text was updated successfully, but these errors were encountered: